Edinburgh Research Explorer · mental sentence processing. CCG has a very ﬂex-ible notion of...

Edinburgh Research Explorer

CCG Parsing Algorithm with Incremental Tree RotationCitation for published version:Stanojevic, M & Steedman, M 2019, CCG Parsing Algorithm with Incremental Tree Rotation. in Proceedingsof the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). vol. 1, Association for ComputationalLinguistics, Minneapolis, Minnesota, pp. 228–239, 2019 Annual Conference of the North American Chapterof the Association for Computational Linguistics, Minneapolis, United States, 2/06/19.<https://www.aclweb.org/anthology/N19-1020>

Link:Link to publication record in Edinburgh Research Explorer

Document Version:Peer reviewed version

Published In:Proceedings of the 2019 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

General rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)and / or other copyright owners and it is a condition of accessing these publications that users recognise andabide by the legal requirements associated with these rights.

Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorercontent complies with UK legislation. If you believe that the public display of this file breaches copyright pleasecontact [email protected] providing details, and we will remove access to the work immediately andinvestigate your claim.

Download date: 13. Feb. 2021

https://www.research.ed.ac.uk/portal/en/persons/milos-stanojevic(539b731f-6fda-4bc3-980b-0ee25daf17a9).html

https://www.research.ed.ac.uk/portal/en/persons/mark-steedman(15b544ff-f14e-4393-bee6-0aa38f4361b6).html

https://www.research.ed.ac.uk/portal/en/publications/ccg-parsing-algorithm-with-incremental-tree-rotation(121cd30e-b679-4062-99a7-0ba7f2e367f2).html

https://www.aclweb.org/anthology/N19-1020

https://www.research.ed.ac.uk/portal/en/publications/ccg-parsing-algorithm-with-incremental-tree-rotation(121cd30e-b679-4062-99a7-0ba7f2e367f2).html

CCG Parsing Algorithm with Incremental Tree Rotation

Milos StanojevicSchool of Informatics

University of [email protected]

Mark SteedmanSchool of Informatics

University of [email protected]

Abstract

The main obstacle to incremental sentenceprocessing arises from right-branching con-stituent structures, which are present in themajority of English sentences, as well as fromoptional constituents that adjoin on the right,such as right adjuncts and right conjuncts. InCCG, many right-branching derivations canbe replaced by semantically equivalent left-branching incremental derivations.

The problem of right-adjunction is more resis-tant to solution, and has been tackled in thepast using revealing-based approaches that of-ten rely either on the higher-order unificationover lambda terms (Pareschi and Steedman,1987) or heuristics over dependency represen-tations that do not cover the whole CCGbank(Ambati et al., 2015).

We propose a new incremental parsing algo-rithm for CCG following the same revealingtradition of work but having a purely syntacticapproach that does not depend on access to adistinct level of semantic representation. Thisalgorithm can cover the whole CCGbank, withgreater incrementality and accuracy than pre-vious proposals.

1 Introduction

Combinatory Categorial Grammar (CCG) (Adesand Steedman, 1982; Steedman, 2000) is a mildlycontext sensitive grammar formalism that is at-tractive both from a cognitive and an engineer-ing perspective. Compared to other grammar for-malisms, the aspect in which CCG excels is incre-mental sentence processing. CCG has a very flex-ible notion of constituent structure which allows(mostly) left-branching derivation trees that areeasier to process incrementally. Take for instancethe derivation tree in Figure 1a. If we use a non-incremental shift-reduce parser (as done in the ma-jority of transition-based parsers for CCG (Zhang

and Clark, 2011; Xu et al., 2014; Xu, 2016)) wewill be able to establish the semantic connectionbetween the subject “Nada” and the verb “eats”only when we reach the end of the sentence. Thisis undesirable for several reasons. First, humansentence processing is much more incremental, sothat the meaning of the prefix “Nada eats” is avail-able as soon as it is read (Marslen-Wilson, 1973).Second, if we want a predictive model—either forbetter parsing or language modelling—it is crucialto establish relations between the words in the pre-fix as early as possible.

To address this problem, a syntactic theoryneeds to be able to represent partial constituentslike “Nada eats” and have mechanisms to buildthem just by observing the prefix. In CCG solu-tions for these problems come out of the theorynaturally. CCG categories can represent partialstructures and these partial structures can combineinto bigger (partial) structures using CCG com-binators recursively. Figure 1b shows how CCGcan incrementally process the example sentencevia a different derivation tree that generates thesame semantics more incrementally by being left-branching.

This way of doing incremental processingseems straightforward except for one obstacle: op-tional constituents that attach from the right, i.e.right adjuncts. Because they are optional, it is im-possible to predict them with certainty. This forcesan eager incremental processor to make an unin-formed decision very early and, if later that deci-sion turns out to be wrong, to backtrack to repairthe mistake. This behaviour would imply that hu-man processors have difficulty in processing rightadjuncts, but that does not seem to be the case.For instance, let’s say that after incrementally pro-cessing “Nada eats apples” we encounter right ad-junct “regularly” as in Figure 2a. The parser willbe stuck at this point because there is no way to at-

Nada eats apples

NP S\NP/NP NP>

S\NP<

S

(a) Right-branching derivation.

Nada eats apples

NP S\NP/NP NP>T

S/(S\NP)>B

S/NP>

S

(b) Left-branching derivation.

Figure 1: Semantically equivalent CCG derivations.

Nada eats apples regularly

NP S\NP/NP NP S\NP\(S\NP)>T

S/(S\NP)>B

S/NP>

S

(a) Problem – S\NP that needs to be modified was neverbuilt.


NP S\NP/NP NP S\NP\(S\NP)>T >

S/(S\NP) S\NP>

S

(b) Incremental tree rotation reveals S\NP.


NP S\NP/NP NP S\NP\(S\NP)>T >

S/(S\NP) S\NP<

S\NP>

S

(c) Right adjunct is attached to the revealed node.

Figure 2: Right adjunction.

tach the right adjunct of a verb phrase to a sentenceconstituent. A simple solution would be some sortof limited back-tracking where we would look ifwe could extract the verb-phrase, attach its rightadjunct, and then put the derivation back together.But how do we do the extraction of the verb-phrase“eats apples” when that constituent was never builtduring the incremental left-branching derivation?

Pareschi and Steedman (1987) proposed to re-veal the constituent that is needed, the verb-phrasein our example, by having an elegant way of re-analysing the derivation. This reanalysis does notrepeat parsing from scratch but instead runs a sin-gle CCG combinatory rule backwards. In the ex-ample at hand, first we recognise that right adjunc-tion needs to take place because we have a cat-egory of shape X\X (concretely (S\NP)\(S\NP)but in the present CCG notation slashes “associateto the left”, so we drop the first pair of brackets).Thanks to the type of the adjunct we know that the

constituent that needs to be revealed is of type X,in our case S\NP. Now, we take the constituent onthe left of the right adjunct, in our example con-stituent S, and look for CCG category Y and com-binatory rule C that satisfies the following relation:C(Y, S\NP) = S. The solution to this type equationis Y=NP and C=<.

To confine revealing to delivering constituentsthat the parser could have built if it had been lessgreedy for incrementality, and exclude revelationof unsupported types, such as PP in Figure 2a, theprocess must be constrained by the actual deriva-tion. Pareschi and Steedman proposed to do so byaccessing the semantic representation in parallel,using higher-order unification, which is in generalundecidable and may be unsound unless definedover a specific semantic representation.

Ambati et al. (2015) propose an alternativemethod for revealing where dependencies are usedas a semantic representation (instead of first-orderlogic) and special heuristics are used for reveal-ing (instead of higher order unification). This iscomputationally a much more efficient approachand appears sound, but requires distinct revealingrules for each constituent type and has specific dif-ficulties with punctuation.

In this paper we propose a method of reveal-ing that does not depend on any specific choiceof semantic representation, can discover multi-ple possible revealing options if they are avail-able, is sound and complete and computation-ally efficient, and gives state-of-the-art parsingresults. The algorithm works by building left-branching derivations incrementally, but, follow-ing Niv (1993, 1994), as soon as a left branchingderivation is built, its derivation tree is rebalancedto be right-branching. When all such constituents’derivation trees are right-branching, revealing be-comes a trivial operation where we just traversethe right spine looking for the constituent(s) of theright type to be modified by the right adjunct.

We call this rebalancing operation tree rota-

tion since it is a technical term established inthe field of data structures for similar operationof balanced binary search trees (Adelson-Velskiiand Landis, 1962; Guibas and Sedgewick, 1978;Okasaki, 1999; Cormen et al., 2009). Figure 2bshows the right rotated derivation “Nada eats ap-ples” next to the adjunct. Here we can just look upthe required S\NP and attach the right adjunct toit as in Figure 2c.

2 Combinatory Categorial Grammar

CCG is a lexicalized grammar formalism whereeach lexical item in a derivation has a category as-signed to it which expresses the ways in whichthe lexical item can be used in the derivation.These categories are put together using combina-tory rules.The binary combinatory rules we use are:X/Y Y ⇒ X (>)Y X\Y ⇒ X (<)X/Y Y/Z ⇒ X/Z (>B)Y \Z X\Y ⇒ X\Z (<B)Y/Z X\Y ⇒ X/Z (<B×)Y/Z|W X\Y ⇒ X/Z|W (<B2

×)X/Y Y/Z|W ⇒ X/Z|W (>B2)

Each binary combinatory rule has one primary andone secondary category as its inputs. The primaryfunctor is the one that selects; while the secondarycategory is the one that is selected. In forwardcombinatory rules the primary functor is alwaysthe left argument, while in the backward combina-tory rules it is always the right.

It is useful to look at the mentioned combina-tory rules in a generalised way. For instance, ifwe look at forward combinatory rules we can seethat they all follow the same pattern of combin-ing X/Y with a category that starts with Y . Theonly difference among them is how many subcate-gories follow Y in the secondary category. In caseof forward function application there will be noth-ing following Y so we can treat forward functionapplication as a generalised forward compositioncombinator of the zeroth order >B0. Standardforward function composition >B will be a gener-alised composition of first order >B1 while >B2

will be >B2. Same generalisation can be appliedto backward combinators. There is a low boundon the order of combinatory rules, around 2 or 3.

Following Hockenmaier and Steedman (2007),the proclitic character of conjunctions is capturedin a syncategorematic rule combining them with

the right conjunct, with the result later combiningwith the left conjunct 1 :conj X ⇒ X[conj] (>Φ)X X[conj] ⇒ X (<Φ)

Some additional unary and binary type-changingrules are also needed to process the derivationsin CCGbank (Hockenmaier and Steedman, 2007).We use the same type-changing rules as those de-scribed in (Clark and Curran, 2007).

Among the unary combinatory rules the mostimportant one is type-raising. The first reason forthat is that it allows CCG to handle constructionslike argument cluster coordination in a straight-forward way. Second, it allows CCG to be muchmore incremental as seen from the example in Fig-ure 1b. Type-raising rules are expressed in the fol-lowing way:X ⇒ Y/(Y \X) (>T)X ⇒ Y \(Y/X) (<T)

Type-raising, is strictly limited to applying to cat-egory types that are arguments, such as NP, PP,etc., making it analogous to grammatical case inlanguages like Latin and Japanese, in spite of thelack of morphological case in English.

3 Parsing

CCG derivations can be parsed with the sameshift-reduce mechanism used for CFG parsing(Steedman, 2000). In the context of CFG parsing,the shift-reduce algorithm is not incremental, be-cause CFG structures are mostly right-branching,but in CCG by changing the derivation via thecombinatory rules we also change the level of in-crementality of the algorithm.

As usual, the shift-reduce algorithm consists ofa stack of the constituents built so far and a bufferwith words that are yet to be processed. Parsingstarts with the stack empty and the buffer contain-ing the whole sentence. The end state is a stackwith only one element and an empty buffer. Tran-sitions between parser states are:• shift(X) – moves the first word from the buffer

to the stack and labels it with category X,• reduceUnary(C) – applies a unary combina-

tory rule C to the topmost constituent on thestack,• reduceBinary(C) – applies a binary combina-

tory rule C to the two topmost constituents on

1This notation differs unimportantly from Steedman(2000) who uses a ternary coordination rule, and more recentwork in which conjunctions are X\X/X .

the stack.CCG shift-reduce parsers are often built over

right-branching derivations that obey Eisner nor-mal form (Eisner, 1996). Processing left-branching derivations is not any different exceptthat it requires an opposite normal form.

Our revealing algorithm adds a couple of mod-ifications to this default shift-reduce algorithm.First, it guarantees that all the trees stored on thestack are right-branching – this still allows left-branching parsing and only adds the requirementof adjusting newly reduced trees on the stack to beright leaning. Second, it adds revealing transitionsthat exploit the right-branching guarantee to applyright adjunction. Both tree rotation and revealingare performed efficiently as described in the fol-lowing subsections.

3.1 Tree rotation

A naıve way of enforcing right-branching guaran-tee is to do a complete transformation of the sub-tree on the stack into a right-branching one. How-ever, that would be unnecessarily expensive. In-stead we do incremental tree rotation to right. Ifwe assume that all the elements on the stack are re-specting this right-branching form (our inductivecase), this state can be disturbed only by reduceBi-nary transition (shift just adds a single word whichis trivially right-branching and reduceUnary doesnot influence the direction of branching). The re-duceBinary transition will take two topmost ele-ments on the stack that are already right-branchingand put them as children of some new binary node.We need to repair that potential “imperfection” ontop of the tree. This is done by recursively rotatingthe nodes as in Figure 3a.2

This figure shows one of the sources of CCG’sspurious ambiguity: parent-child relation of thecombinatory rules with the same directionality.Here we concentrate on forward combinators be-cause they are the most frequent in our data—most backward combinators disappear with theaddition of forward type-raising and the additionof special right adjunct transitions—but the samemethod can be applied to backward combinatoryrules as a mirror image. Having two combina-tory rules of the same directionality is necessary

2Although we do not discuss the operations on the se-mantic predicate-argument structure that correspond to tree-rotation, the combinatory semantics of the rules themselvesguarantees that such operations can be done uniformly and inparallel.

>Bx

>By if y 6= 0=⇒

>B(x + y − 1)

>Bx

α β

γ α

β γ

(a) Rotate to right.

>Bx

>By if x > y=⇒

>By

>B(x− y + 1)

α

β γ α β

γ

(b) Rotate to left.

Figure 3: Tree rotation operations. The red square sig-nifies recursion. Variables x and y represent the ordersof the combinatory rules.

but not sufficient condition for spurious ambigu-ity. As visible on the Figure 3a side condition, thelower combinator must not be>B0. The tree rota-tion function assumes that both of the children are“perfect”—meaning right-branching3— and thatthe only imperfection is on the root node. Themethod repairs this imperfection on the root by ap-plying the tree rotation transformation, but it alsocreates a new node as a right child and that nodemight be imperfect. That is why the method goesdown the right node recursively until all the imper-fections are removed and the whole tree becomesfully right-branching. In the worst case the methodwill reach the bottom of the tree, but often only3 or 4 nodes need to be transformed to make thetree perfectly the right branching The worst casecomplexity of repairing the imperfection is O(n)which makes the complexity of the whole parsingalgorithm O(n2) for building a single derivation.

As a running example we will use a derivationtree in Figure 4a for which a transition sequenceis given in Figure 4b. Here tree rotation is usedin transitions 6 and 8 that introduce imperfections.In transition 6 a single tree rotation at the top wasenough to correct the imperfection, while in tran-sition 8 recursive tree rotation function went todepth two.

If the upper and lower combinators are both>B2 the topmost combinator on the right will be-

3By right branching we mean as right branching as it is al-lowed by CCG formalism and predicate-argument structure.

Harry might find happiness and forget me

NP S\NP/(S\NP) S\NP/NP NP conj S\NP/NP NP>T >B0 >B0

S/(S\NP) S\NP S\NP>B1 >Φ

S/(S\NP) S\NP [conj ]<Φ

S\NP>B0

S

(a) Derivation tree.

transition stack1 shift =⇒ � Harry2 reduceUnary(>T) =⇒ � >T Harry3 shift =⇒ � >T Harry � might4 reduceBinary(>B1) =⇒ � >B1 (>T Harry) might5 shift =⇒ � >B1 (>T Harry) might � find6 reduceBinary(>B1) =⇒ � >B1 (>B1 (>T Harry) might) find

rotate to right =⇒ � >B1 (>T Harry) (>B1 might find)7 shift =⇒ � >B1 (>T Harry) (>B1 might find) � happiness8 reduceBinary(>B1) =⇒ � >B0 (>B1 (>T Harry) (>B1 might find)) happiness

rotate to right =⇒ � >B0 (>T Harry) (>B0 (>B1 might find) happiness)rotate to right =⇒ � >B0 (>T Harry) (>B0 might (>B0 find happiness))

9 shift =⇒ � >B0 (>T Harry) (>B0 might (>B0 find happiness)) � and10 shift =⇒ � >B0 (>T Harry) (>B0 might (>B0 find happiness)) � and � forget11 shift =⇒ � >B0 (>T Harry) (>B0 might (>B0 find happiness)) � and � forget � me12 reduceBinary(>B0) =⇒ � >B0 (>T Harry) (>B0 might (>B0 find happiness)) � and � >B0 forget me13 reduceBinary(>Φ) =⇒ � >B0 (>T Harry) (>B0 might (>B0 find happiness)) � >Φ and (>B0 forget me)14 reveal =⇒ � two options for right adjunction and both with <Φ combinator:

option 1: >B0 might (>B0 find happiness)option 2: >B0 find happiness

15 pick option 2 =⇒ � >B0 (>T Harry) (>B0 might (<Φ (>B0 find happiness) (>Φ and (>B0 forget me) ) ) )

(b) Transition sequence for the derivation tree.

Figure 4: Example of the algorithm run over a sentence with tensed VP coordination.

come >B3, a combinatory rule that may be un-necessary for defining the competence grammar ofhuman languages, but which is required if pars-ing performance is to be as incremental as possi-ble. Fortunately, the configuration with two con-nected>B2 combinatory rules appears very rarelyin CCGbank.

Many papers have been published on using left-branching CCG derivations but, to the best of ourknowledge, none of them explains how are theyconstructed from right-branching CCGbank trees.A very simple algorithm for that can be made us-ing our tree rotation function. Here we use rotationin the opposite direction i.e. rotation to left (Fig-ure 3b). We cannot apply this operation from thetop node of the CCGbank tree because that treedoes not satisfy the assumption of the algorithm:immediate children are not “perfect” (here perfectmeans being left-branching). That is why we startfrom the bottom of the tree with terminal nodesthat are trivially “perfect” and apply tree transfor-mation to each node in post-order traversal.

This incremental tree rotation algorithm is in-

spired by the AVL self-balancing binary searchtrees (Adelson-Velskii and Landis, 1962) andRed-Black trees (Guibas and Sedgewick, 1978;Okasaki, 1999). The main difference is that herewe are trying to do the opposite of AVLs: insteadof making the tree perfectly balanced we are tryingto make it perfectly unbalanced, i.e. leaning to theright (or left). Also, our imperfections start at thetop and are pushed to the bottom of the tree whichis in contrast to AVLs trees where imperfectionsstart at the bottom and get pushed to the top.

The last important point about tree rotation con-cerns punctuation rules. All punctuation is at-tached to the left of the highest possible node incase of left-branching derivations (Hockenmaierand Bisk, 2010), while in the right-branchingderivations we lower the punctuation to the bot-tom left neighbouring node. Punctuation has noinfluence on the predicate-argument structure so itis safe to apply this transformation.

3.2 Revealing transitions

If the topmost element on the stack is of the formX\X and the second topmost element on the stackhas on its right edge one or more constituents ofa type X|$ we allow reveal transition.4 This is amore general way of revealing than approaches ofPareschi and Steedman (1987) and Ambati et al.(2015) who attempt to reveal only constituents oftype X while we reveal any type that has X as itsprime element (that is the meaning of X|$ nota-tion).

We also treat X[conj] as right adjuncts of the leftconjunct. Similarly to the previous case, if the top-most element on the stack is X[conj] and the rightedge of the second topmost element on the stackhas constituent(s) of type X, they are revealed forpossible combination via <Φ combinator.

If reveal transition is selected, as in transition14 in Figure 4b, the parser enters into a modeof choosing among different constituents labelledX|$ that could be modified by the right adjunctX\X. After particular X|$ node is chosen X\X iscombined with it and the rest of the tree aboveX node is rebuilt in the same way. This rebuildis fully deterministic and is done quickly eventhough in principle it could take O(n) to com-pute. Even in the worst case scenario, it does notmake the complexity of the algorithm go higherthan O(n2).

The ability of our algorithm to choose amongdifferent possible revealing options is uniqueamong all the proposals for revealing. For transi-tion 15 in Figure 4b the parser can choose whetherto adjoin (coordinate) to a verb phrase that alreadycontains a left modifier or without. This is simi-lar to Selective Modifier Placement strategy fromolder Augmented Transition Network (ATN) sys-tems (Woods, 1973) which finds all the attachmentoptions that are syntactically legal and then allowsthe parser to choose among those using some cri-teria. Woods (1973) suggests using lexical seman-tic information for this selection, but in his ATNsystem only handwritten semantic selection ruleswere used. Here we will also use selection basedon the lexical content but it will be broad coverageand learned from the data. This ability to seman-tically select the modifier’s attachment point is es-sential for good parsing results as will be shown

4The “$ notation” is from (Steedman, 2000) where $ isused as a (potentially empty) placeholder variable rangingover multiple arguments.

later.

4 Neural Model

The neural probabilistic model that chooses whichtransition should be taken next conditions on thewhole state of the configuration in a similar wayto RNNG parser (Dyer et al., 2016). The words inthe sentence are first embedded using the concate-nation of top layers of ELMo embeddings (Peterset al., 2018) that are normalised to L2 norm andthen refined with two layers of bi-LSTM (Graveset al., 2005). The neural representation of the ter-minal is composed of concatenated ELMo embed-ding and supertag embedding.

The representation of a subtree combines:• span representation – we subtract representa-

tion of the leftmost terminal from the repre-sentation of the rightmost terminal as done inLSTM-Minus architecture (Wang and Chang,2016),• combinator and category embeddings,• head words encoding – because each con-

stituent can have a set of heads, for instancearising from coordination, we model repre-sentation of heads with DeepSet architecture(Zaheer et al., 2017) over representations ofhead terminals.

We do not use recursive neural networks like Tree-LSTM (Tai et al., 2015) to encode subtrees be-cause of the frequency of tree rotation. These op-erations are fast, but they would trigger frequentrecomputation of the neural tree representation, sowe opted for a mechanism that is invariant to re-branching.

The stack representation is encoded usingStack-LSTM (Dyer et al., 2015). The configu-ration representation is the concatenation of thestack representation and the representation of therightmost terminal in the stack. The next non-revealing transition is chosen by a two-layer feed-forward network.

If the reveal transition is triggered, the systemneeds to choose which among the candidate nodesX|$ to adjoin the right modifier X\X to. The num-ber of these modifiers can vary so we cannot usea simple feed-forward network to choose amongthem. Instead, we use the mechanism of Pointernetworks (Vinyals et al., 2015), which works in asimilar way to attention (Bahdanau et al., 2014)except that attention weights are interpreted asprobabilities of selecting any particular node. At-

Waitingtime

Connect-edness

Right-branching 4.29 5.01Left-branching 2.32 3.15Ambati et al. (2015)* 0.69 2.15Revealing our 0.46 1.72

Table 1: Train set measure of incrementality. *: takenfrom (Ambati et al., 2015)

tention is computed over representations of eachcandidate node. Because we expect that therecould be some preference for attaching adjunctshigh or low in the tree we add to the context rep-resentation of each node two position embeddings(Vaswani et al., 2017) that encode the candidatenode’s height and depth in the current tree.

We optimize for maximum log-likelihood onthe training set, using only the most frequent su-pertags and the most important combinators. Toavoid discarding sentences with rare supertags andtype-changing rules we use all supertags and com-binatory rules during training but do not add theirprobability to the loss function. The number ofsupertags used is 425, as in the EasyCCG parser,and the combinatory rules that are used are thesame as in C&C parser. The loss is minimisedfor 15 epochs on the training portion of CCG-bank (Hockenmaier and Steedman, 2007) usingAdam with learning rate 0.001. Dimensionalityis set to 128 in all cases, except for ELMo set at300. Dropout is applied only to the ELMo in-put with a rate of 0.2. The parser is implementedin Scala using the DyNet toolkit (Neubig et al.,2017) and is available at https://github.com/stanojevic/rotating-ccg.

5 Experiments

5.1 How incremental is the Revealingalgorithm?

To measure the incrementality of the proposed al-gorithm we use two evaluation metrics: waitingtime and connectedness. Waiting time is the aver-age number of nodes that need to be shifted beforethe dependency between two nodes is established.The minimal value for a fully incremental algo-rithm is 0 (the single shift that is always necessaryis not counted). Connectedness is defined as theaverage stack size before a shift operation is per-formed (the initial two shifts are forced so they arenot taken in the average). The minimal value forconnectedness is 1. We have computed these mea-

heads SMP LF UFSup.Tag

Left yes — 89.2 95.1 95.0Right yes — 89.1 95.0 95.1Revealing no yes 89.3 95.2 94.9Revealing yes no 88.8 94.9 94.9Revealing yes yes 89.5 95.4 95.1

Table 2: Development set F1 results with greedy de-coding for CCG dependencies.

sures on the training portion of the CCGbank forstandard non-incremental right-branching deriva-tions, the more incremental left-branching deriva-tions and our revealing derivations. We also putin the results numbers for the previous proposal ofrevealing by Ambati et al. (2015) taken from theirpaper but these numbers should be taken with cau-tion, because it is not clear from the paper whetherthe authors computed them in the same way andon the same portion of the dataset as we did. Ta-ble 1 results shows that our revealing derivationsare significantly more incremental even in com-parison to previous revealing proposals, and barelyuse more than the minimal amount of stack mem-ory.

5.2 Which algorithm gives the best parsingresults?

We have tested on the development set which ofthe parsing algorithms gives best parsing accuracy.All the algorithms use the same neural architec-ture and training method except for the revealingoperations that require additional mechanisms tochoose the node for revealing. This allows us toisolate machine learning factors and see which ofthe parsing strategies works the best.

There are two methods that are often used forevaluating CCG parsers. They are both based on“deep” dependencies extracted from the derivationtrees. The first is from (Clark et al., 2002) and iscloser to categorial grammar view of dependen-cies. The second is from (Clark and Curran, 2007)and is meant to be more formalism independentand closer to standard dependencies (Caroll et al.,1998). We opt for the first option for developmentas we find it more robust and reliable but we reportboth types on the test set.

Table 2 shows the results on development set.The heads column shows if the head words repre-sentation is used for computing the representationof the nodes in the tree. The SMP column shows if

https://github.com/stanojevic/rotating-ccg

https://github.com/stanojevic/rotating-ccg

1 2 4 8 1689

89.2

89.4

89.6

89.8

90

1 2 4 8 16

89.58

89.19

89.83

89.21

89.61

89.43

Beam size

Lab

elle

dF1

RevealingLeft

Right

Figure 5: Influence of beam size on the dev results.

Selective Modifier Placement is used: whether wechoose where to attach right adjunct based onlyon the position embeddings or also on the node’slexical content. First we can see that Revealingapproach that uses head representation and doesselective modifier placement outperforms all themodels both on labelled and unlabelled dependen-cies. Ablation experiments show that SMP wasthe crucial component: without it the Revealingmodel is much worse. This is a clear evidencethat attachment heuristics are not enough and alsothat previous approaches that extract only singlerevealing option are sub-optimal.

A possible reason why Revealing model worksbetter than Left and Right branching models is thatLeft and Right models need to commit early onwhether there will be a right adjunct in the futureor not. If they make a mistake during greedy de-coding there will be no way to repair that mistake.This is not an issue for the Revealing model be-cause it can attach right adjuncts at any point anddoes not need to forecast them. A natural questionthen is if these improvements of Revealing modelwill stay if we use a bigger beam. Figure 5 showsexactly that experiment. We see that the modelthat gains the most from the biggest beam is forthe Left-branching condition, which is expectedsince that is the model that commits to its predic-tions the most — it commits with type-raising, un-like Right model, and it commits with predictingright adjunction, unlike Revealing model. With anincreased beam Left model equals the Revealinggreedy model. But if all the models use the samebeam the Revealing model remains the best. Aninteresting result is that the small beam of size 4is enough to get the maximal improvement. This

LF UF Sup. TagLewis and Steedman (2014) 81.3 88.6 93.0Ambati et al. (2015) 81.4 89.0 91.2Hockenmaier (2003) 84.4 92.0 92.2Zhang and Clark (2011) 85.5 — 93.1Clark and Curran (2007) 87.6 93.0 94.3Revealing (beam=1) 89.8 95.5 95.2Revealing (beam=4) 90.2 95.8 95.4

Table 3: Test set F1 results for labelled and unlabelledCCG dependencies extracted using scripts from Hock-enmaier (2003) parser.

tri-train

ELMo Dev LF Test LF

Clark and Curran (2007) — — 83.8 85.2Xu et al. (2016) — — 87.5 87.8Lewis and Steedman (2014) — — 87.5 88.1Vaswani et al. (2016) — — 87.8 88.3Lee et al. (2016) yes — 88.4 88.7Yoshikawa et al. (2017) — — 86.8 87.7Yoshikawa et al. (2017) yes — 87.7 88.8Yoshikawa et al. pc yes yes 90.4 90.4Revealing (beam=1) — yes 90.8 90.5

Table 4: F1 results for labelled dependencies extractedwith generate program of C&C parser (Clark and Cur-ran, 2007).

probably reflects the low degree of lexical ambigu-ity that is unresolved at each point during parsing.

5.3 Comparison to other published models

We compute test set results for our Revealingmodel and compare it to most of the previous re-sults on CCGbank using both types of dependen-cies. Table 3 shows results with (Clark et al.,2002) style dependencies. Here we get state-of-the-art results by a large margin, probably mostlythanks to the machine learning component of ourparser. An interesting comparison to be made isagainst EasyCCG parser of Lewis and Steedman(2014). This parser uses a neural supertagger ofaccuracy that is not too far from ours, but the de-pendencies extracted by our parser are much moreaccurate. This shows that a richer probabilisticmodel that we use contributes more to the goodresults than the exact A? search that EasyCCGdoes with a more simplistic model. Another com-parison of relevance would be with the revealingmodel of Ambati et al. (2015) but the compari-son of the algorithms is difficult since the machinelearning component is very different: Ambati usesa structured perceptron while our model is a heav-ily parametrized neural network.

In Table 4 we show results with the second typeof dependencies used for CCG evaluation. Herethe comparison is slightly fairer because all themodels, except Clark and Curran (2007), are neu-ral and use external embeddings, although onecould argue that the type of embeddings we use(ELMo) are making the main difference. From thepresented models only Xu et al. (2016) is transi-tion based. All other models have a global searcheither via CKY or A* search. Our revealing-basedparser that does greedy search outperforms all ofthem including those trained on large amounts ofunlabelled data using semi-supervised techniques(Lee et al., 2016; Yoshikawa et al., 2017).

After completing our work, the authors ofYoshikawa et al. (2017) drew our attention tothe unpublished augmented version of their parserthat uses ELMo embeddings which we mark asYoshikawa et al pc in Table 4. Our parser outper-forms this version of the parser too, although by asmaller margin which just shows the importanceof ELMo embeddings. ELMo embeddings arean effective way of doing semi-supervised train-ing because they extract useful information fromunlabelled data. Yoshikawa’s parser additionallyuses tri-training to have another form of semi-supervised training which more directly optimisesthe model for CCG parsing (unlike ELMo whichis trained for non-CCG objectives). Tri-trainingseems to be responsible for around 1% of F1 scoreso if added to our model it could potentially in-crease the improvement over Yoshikawa’s modelin a comparable setting.

6 Other relevant work

Recurrent Neural Network Grammar (RNNG)(Dyer et al., 2016) is a fully incremental top-downparsing model. Because it is top-down it has no is-sues with right branching structures, but right ad-juncts would still make parsing more difficult forRNNG because they will have to be predicted evenearlier than in Left- and Right- branching deriva-tions in CCG.

Left-corner parsers (which can be seen as amore constrained version of CCG Left-branchingparsing strategy) seem more psychologically real-istic than top-down parsers (Abney and Johnson,1991; Resnik, 1992; Johnson and Roark, 2000;Stanojevic and Stabler, 2018). Some proposalsabout handling right adjunction in left-corner pars-ing are based on extension to generalized left-

corner parsers (Demers, 1977; Hale, 2014) thatcan force some grammar rules (in particular right-adjunction rules) to be less incremental. Our ap-proach does not decrease incrementality of theparser in this way. On the contrary, having a spe-cial mechanism for right adjunction makes parserboth more incremental and more accurate.

Revealing based on higher order unification byPareschi and Steedman (1987) was also proposedby Steedman (1990) as the basis for CCG ex-planation of gapping. The present derivation-based mechanism for revealing does not extendto gapping, and is targeting to model only deriva-tions that could be explained with a standardCCG grammar derived from CCGbank. Whilethat guarantees that we stay in the safe zone ofsound and complete “standard” CCG derivations,it would be good as a future work to extend sup-port for gapping and other types of derivations notpresent in CCGbank.

Niv (1993, 1994) proposed an alternative to theunification-based account of Pareschi and Steed-man similar to our proposal for online tree rota-tion. Niv’s parser is mostly a formal treatment ofleft-to-right rotations evaluated against psycholin-guistic garden paths, but lacks the wide coverageimplementation and statistical parsing model as abasis for resolving attachment ambiguities.

7 Conclusion

We have presented a revealing-based incremen-tal parsing algorithm that has special transitionsfor handling right-adjunction. The parser is neu-tral with regard to the particular semantic repre-sentation used. I is computationally efficient, andcan reveal all possible constituents types. It is themost incremental CCG parser yet proposed, andhas state-of-the-art results against all publishedparsers trained on the under both dependency re-covery measures that are in use for the purpose.

Acknowledgments

This work was supported by ERC H2020 Ad-vanced Fellowship GA 742137 SEMANTAXgrant.

ReferencesSteven P. Abney and Mark Johnson. 1991. Memory re-

quirements and local ambiguities of parsing strate-gies. Journal of Psycholinguistic Research, 20:233–249.

G M Adelson-Velskii and E M Landis. 1962. An al-gorithm for the organization of information. SovietMathematics Doklady, 3(2):263–266.

Anthony Ades and Mark Steedman. 1982. On the orderof words. Linguistics and Philosophy, 4:517–558.

Bharat Ram Ambati, Tejaswini Deoskar, Mark John-son, and Mark Steedman. 2015. An Incremental Al-gorithm for Transition-based CCG Parsing. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages53–63. Association for Computational Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

John Caroll, Ted Briscoe, and Antonio Sanfilippo.1998. Parser evaluation: a survey and a new pro-posal. In First International Conference on lan-guage resources & evaluation: Granada, Spain, 28-30 May 1998, pages 447–456. European LanguageResources Association.

Stephen Clark and James R Curran. 2007. Wide-coverage efficient statistical parsing with CCGand log-linear models. Computational Linguistics,33(4):493–552.

Stephen Clark, Julia Hockenmaier, and Mark Steed-man. 2002. Building deep dependency structureswith a wide-coverage CCG parser. In Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics, pages 327–334. Associationfor Computational Linguistics.

Thomas H. Cormen, Charles E. Leiserson, Ronald L.Rivest, and Clifford Stein. 2009. Introduction toAlgorithms, Third Edition, 3rd edition. The MITPress.

Alan J. Demers. 1977. Generalized left corner pars-ing. In 4th Annual ACM Symposium on Principlesof Programming Languages, pages 170–181.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers), pages 334–343, Beijing, China. Associa-tion for Computational Linguistics.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A Smith. 2016. Recurrent neural networkgrammars. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 199–209.

Jason Eisner. 1996. Efficient normal-form parsing forCombinatory Categorial Grammar. In Proceedingsof the 34th annual meeting on Association for Com-putational Linguistics, pages 79–86. Association forComputational Linguistics.

Alex Graves, Santiago Fernandez, and Jurgen Schmid-huber. 2005. Bidirectional LSTM Networks forImproved Phoneme Classification and Recogni-tion. In Proceedings of the 15th InternationalConference on Artificial Neural Networks: For-mal Models and Their Applications - Volume PartII, ICANN’05, pages 799–804, Berlin, Heidelberg.Springer-Verlag.

Leo J. Guibas and Robert Sedgewick. 1978. A dichro-matic framework for balanced trees. In Proceedingsof the 19th Annual Symposium on Foundations ofComputer Science, SFCS ’78, pages 8–21, Washing-ton, DC, USA. IEEE Computer Society.

John T. Hale. 2014. Automaton Theories of HumanSentence Comprehension. CSLI, Stanford.

Julia Hockenmaier. 2003. Data and models for sta-tistical parsing with Combinatory Categorial Gram-mar. Ph.D. thesis, University of Edinburgh. Collegeof Science and Engineering. School of Informatics.

Julia Hockenmaier and Yonatan Bisk. 2010. Normal-form Parsing for Combinatory Categorial Grammarswith Generalized Composition and Type-raising. InProceedings of the 23rd International Conferenceon Computational Linguistics, COLING ’10, pages465–473, Stroudsburg, PA, USA. Association forComputational Linguistics.

Julia Hockenmaier and Mark Steedman. 2007. CCG-bank: a corpus of CCG derivations and dependencystructures extracted from the Penn Treebank. Com-putational Linguistics, 33(3):355–396.

Mark Johnson and Brian Roark. 2000. Compactnon-left-recursive grammars using the selective left-corner transform and factoring. In Proceedings ofthe 18th International Conference on ComputationalLinguistics, COLING, pages 355–361.

Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2016.Global Neural CCG Parsing with Optimality Guar-antees. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Process-ing, pages 2366–2376. Association for Computa-tional Linguistics.

Mike Lewis and Mark Steedman. 2014. A* CCG pars-ing with a supertag-factored model. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages990–1000.

William Marslen-Wilson. 1973. Linguistic structureand speech shadowing at very short latencies. Na-ture, 244:522–523.

http://en.scientificcommons.org/19884302

http://en.scientificcommons.org/19884302

https://doi.org/10.3115/v1/N15-1006

https://doi.org/10.3115/v1/N15-1006

http://www.aclweb.org/anthology/P15-1033



http://dl.acm.org/citation.cfm?id=1986079.1986220



https://doi.org/10.1109/SFCS.1978.3

https://doi.org/10.1109/SFCS.1978.3




https://doi.org/10.18653/v1/D16-1262

https://doi.org/10.18653/v1/D16-1262

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, DanielClothiaux, Trevor Cohn, Kevin Duh, ManaalFaruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji,Lingpeng Kong, Adhiguna Kuncoro, Gaurav Ku-mar, Chaitanya Malaviya, Paul Michel, YusukeOda, Matthew Richardson, Naomi Saphra, SwabhaSwayamdipta, and Pengcheng Yin. 2017. Dynet:The dynamic neural network toolkit. arXiv preprintarXiv:1701.03980.

Michael Niv. 1993. A Computational Model of Syn-tactic Processing: Ambiguity Resolution from Inter-pretation. Ph.D. thesis, University of Pennsylvania.IRCS Report 93-27.

Michael Niv. 1994. A psycholinguistically motivatedparser for CCG. In Proceedings of the 32nd annualmeeting on Association for Computational Linguis-tics, pages 125–132. Association for ComputationalLinguistics.

Chris Okasaki. 1999. Red-black trees in a func-tional setting. Journal of functional programming,9(4):471–477.

Remo Pareschi and Mark Steedman. 1987. A LazyWay to Chart-parse with Categorial Grammars. InProceedings of the 25th Annual Meeting on As-sociation for Computational Linguistics, ACL ’87,pages 81–88, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proc. of NAACL.

Philip Resnik. 1992. Left-corner parsing and psycho-logical plausibility. In Proceedings of the 14th Inter-national Conference on Computational Linguistics,COLING 92, pages 191–197.

Milos Stanojevic and Edward Stabler. 2018. A soundand complete left-corner parsing for minimalistgrammars. In Proceedings of the Eight Workshopon Cognitive Aspects of Computational LanguageLearning and Processing, pages 65–74. Associationfor Computational Linguistics.

Mark Steedman. 1990. Gapping as constituent coordi-nation. Linguistics and philosophy, 13(2):207–263.

Mark Steedman. 2000. The Syntactic Process. MITPress, Cambridge, MA.

Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 1556–1566, Beijing, China. Association forComputational Linguistics.

Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and RyanMusa. 2016. Supertagging With LSTMs. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages232–237. Association for Computational Linguis-tics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran As-sociates, Inc.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,editors, Advances in Neural Information ProcessingSystems 28, pages 2692–2700. Curran Associates,Inc.

Wenhui Wang and Baobao Chang. 2016. Graph-basedDependency Parsing with Bidirectional LSTM. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 2306–2315. Association forComputational Linguistics.

William Woods. 1973. An experimental parsing sys-tem for Transition Network Grammars. In RandallRustin, editor, Natural Language Processing, pages111–154. Algorithmics Press, New York.

Wenduan Xu. 2016. LSTM shift-reduce CCG parsing.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages1754–1764.

Wenduan Xu, Michael Auli, and Stephen ChristopherClark. 2016. Expected f-measure training for shift-reduce parsing with recurrent neural networks. InProceedings of the 54nd Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers). Association for Computational Lin-guistics.

Wenduan Xu, Stephen Clark, and Yue Zhang. 2014.Shift-reduce CCG parsing with a dependencymodel. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 218–227.

Masashi Yoshikawa, Hiroshi Noji, and Yuji Mat-sumoto. 2017. A* ccg parsing with a supertag anddependency factored model. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages277–287. Association for Computational Linguis-tics.

M. Zaheer, S. Kottur, M. Ravanbakhsh, B. Poczos,R. Salakhutdinov, and A. Smola. 2017. Deep sets.In NIPS. (Accepted for oral presentation, 1.23

https://doi.org/10.3115/981175.981187

https://doi.org/10.3115/981175.981187

http://aclweb.org/anthology/W18-2809






https://doi.org/10.18653/v1/N16-1027

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/5866-pointer-networks.pdf

https://doi.org/10.18653/v1/P16-1218

https://doi.org/10.18653/v1/P16-1218

Yue Zhang and Stephen Clark. 2011. Shift-reduceCCG parsing. In Proceedings of the 49th AnnualMeeting of the Association for Computational Lin-guistics: Human Language Technologies-Volume 1,pages 683–692. Association for Computational Lin-guistics.

Date post:	03-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Edinburgh Research Explorer · mental sentence processing. CCG has a very ﬂex-ible notion of...

Documents