Without lexicons, multiword expression identification will ...

HAL Id: hal-02318241https://hal.archives-ouvertes.fr/hal-02318241

Submitted on 16 Oct 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Without lexicons, multiword expression identificationwill never fly: A position statement

Agata Savary, Silvio Ricardo Cordeiro, Carlos Ramisch

To cite this version:Agata Savary, Silvio Ricardo Cordeiro, Carlos Ramisch. Without lexicons, multiword expressionidentification will never fly: A position statement. Joint Workshop on Multiword Expressions andWordNet (MWE-WN 2019), Aug 2019, Florence, Italy. pp.79 - 91, �10.18653/v1/W19-5110�. �hal-02318241�

https://hal.archives-ouvertes.fr/hal-02318241

https://hal.archives-ouvertes.fr

Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), pages 79–91Florence, Italy, August 2, 2019. c©2019 Association for Computational Linguistics

79

Without lexicons, multiword expression identification will never fly:A position statement

Agata SavaryUniversity of Tours, [email protected]

Silvio Ricardo CordeiroParis-Diderot University

[email protected]

Carlos RamischAix Marseille University,

Université de Toulon, CNRSLIS, Marseille, [email protected]

Abstract

Because most multiword expressions(MWEs), especially verbal ones, are seman-tically non-compositional, their automaticidentification in running text is a prerequi-site for semantically-oriented downstreamapplications. However, recent developments,driven notably by the PARSEME shared taskon automatic identification of verbal MWEs,show that this task is harder than related tasks,despite recent contributions both in multilin-gual corpus annotation and in computationalmodels. In this paper, we analyse possiblereasons for this state of affairs. They lie in thenature of the MWE phenomenon, as well asin its distributional properties. We also offera comparative analysis of the state-of-the-artsystems, which exhibit particularly strongsensitivity to unseen data. On this basis, weclaim that, in order to make strong headwayin MWE identification, the community shouldbend its mind into coupling identification ofMWEs with their discovery, via syntacticMWE lexicons. Such lexicons need notnecessarily achieve a linguistically completemodelling of MWEs’ behavior, but theyshould provide minimal morphosyntacticinformation to cover some potential uses, soas to complement existing MWE-annotatedcorpora. We define requirements for sucha minimal NLP-oriented lexicon, and wepropose a roadmap for the MWE communitydriven by these requirements.

1 Introduction

Multiword expression (MWE) is a generic termwhich encompasses a large variety of linguisticobjects: compounds (to and fro, crystal clear,a slam dunk ‘an easily achieved victory’)1, ver-bal idioms (to take pains ‘to try hard’), light-verb

1Henceforth, we highlight in bold the lexicalized com-ponents of MWEs, i.e. those always realized by the samelexemes.

constructions (to pay a visit ), verb-particle con-structions (to take off ), institutionalized phrases(traffic light ), multiword terms (neural net-work ) and multiword named entities (FederalBureau of Investigation). They all share thecharacteristic of exhibiting lexical, morphosyntac-tic, semantic, pragmatic and/or statistical idiosyn-crasies (Baldwin and Kim, 2010). Most notably,they usually display non-compositional semantics,i.e. their meaning cannot be deduced from themeanings of their components and from their syn-tactic structure in a way deemed regular for thegiven language. Computational methods are, con-versely, mostly compositional, therefore they of-ten fail to model and process MWEs appropriately.Special, MWE-dedicated, treatment can be envis-aged, provided that we know which parts of thetext are concerned, i.e. we should be able to per-form MWE identification.

MWE identification (MWEI) consists in auto-matically annotating MWEs occurrences in run-ning text (Constant et al., 2017). In other words,we need to be able to distinguish MWEs (e.g. takepains) from regular word combinations (e.g. takegloves) in context. This task proves very chal-lenging for some categories of MWEs, as evi-denced by two recent PARSEME shared tasks onautomatic identification of verbal MWEs (Savaryet al., 2017; Ramisch et al., 2018). We claim thatthe difficulty of this task lies in the nature of id-iosyncrasies that various categories of MWEs ex-hibit with respect to regular word combinations.Namely, whereas many constructions (e.g. namedentities) have a good generalisation potential formachine learning NLP methods, other MWEs, e.g.verbal ones, are mostly regular at the level of to-kens, so the generalisation power of mainstreammachine learning is relatively weak for them.However, they are idiosyncratic at the level oftypes (sets of surface realizations of the same ex-

80

pression), therefore type-specific information, ex-ploited by MWE discovery methods and encodedin lexicons, should be very helpful for MWEI.

This paper is a position statement based on ananalysis of the state of the art in MWEI. We claimthat, in order to make strong headway in MWEI,the community should bend its mind into cou-pling this task with MWE discovery via syntac-tic MWE lexicons. Such lexicons need not nec-essarily achieve a linguistically complete mod-elling of MWEs’ behavior, but they should provideminimal morphosyntactic information to coversome potential uses, so as to complement existingMWE-annotated corpora. This also implies that,in building such lexicons, we can take advantageof the rich body of works dedicated to MWE dis-covery methods (Evert, 2005; Pecina, 2008; Sere-tan, 2011; Ramisch, 2015), provided that they areextended, so as to: (i) cover most syntactic typesof MWEs, (ii) produce not only lists of newly dis-covered MWE entries but also their type-specificmorphosyntactic properties.

The remainder of this paper is organized as fol-lows. We discuss some linguistic properties ofMWEs (Sec. 2) and state-of-the-art results (Sec. 3)relevant to our claims. We propose a scenario forcoupling MWEI with MWE discovery via syntac-tic MWE lexicons (Sec. 6). Finally, we concludeby proposing a roadmap for the future efforts ofthe MWE community (Sec. 7).

2 The nature of MWEs

We propose to divide MWE categories roughlyinto two meta-categories, depending on the na-ture of the processes which provoke their lexical-ization, that is, the assignment of conventional,fixed, non-compositional meanings. On the onehand, there are multiword named entities (NEs)and multiword terms, henceforth called sublan-guage MWEs (SL-MWEs), whose form-meaningassociation is usually determined by sublanguageexperts. Because such expert groups are moreor less restricted and have dedicated nomencla-ture instruments (scientific publications, namingcommittees, etc.), and because technological do-mains and real-world entities to name developrapidly, multiword terms and NEs strongly prolif-erate. On the other hand, general language MWEs(GL-MWEs)2 are coined by much larger commu-

2The border between SL-MWEs and GL-MWEs is fuzzy,but this characterization is useful for our argumentation.

nities of speakers via informal processes, and takelonger to be established in a language. This prolif-eration speed property (henceforth referred to asPprolif) is the first SL-MWE vs. GL-MWE discrep-ancy we are interested in.

The second property (henceforth, Pdiscr) is thenature of discrepancies which statistically distin-guish MWEs from regular word combinations.SL-MWEs exhibit peculiarities at the level of to-kens (individual occurrences). For instance multi-word NEs are usually capitalized and often con-tain, follow or precede trigger words (Bureau,river, Mr.). Multiword terms often contain wordswhich are less likely in general than in techni-cal language (neural ). GL-MWEs, conversely, aremostly regular at the level of tokens (e.g. they useno capitalization, are rarely signaled by triggers,and contain common frequent words) but idiosyn-cratic at the level of types (sets of surface real-izations of the same expression). For instance, totake pains ‘to try hard’ does not admit noun in-flection (i.e. to take the pain cannot be interpretedidiomatically), while similar regular word com-binations like to take gloves and to relieve painshave very similar meaning to their morphosyntac-tic variants to take the glove or to relieve the pain.

The third relevant property (Psim) is the compo-nent similarity among MWEs. A strong similar-ity, whether at the level of surface forms or at thelevel of semantics, often occurs between compo-nents of different SL-MWEs. For instance, newmultiword terms are often created by modifica-tion or specialization of previously existing ones(neural network, neural net, recurrent neuralnetwork, neural network pushdown automata,etc.). Also, many types of NEs come in se-ries in which some components are identical andsome others vary within a given semantic class,e.g. American/Brazilian/French/Ethiopian RedCross, Nigerian Red Cross Society, Ira-nian/Iraki Red Crescent Society, Saudi RedCrescent Authority. In GL-MWEs, the degreeof Psim depends on the category. It is stronger inlight-verb constructions, i.e. verb-noun combina-tions in which the verb is semantically void orbleached, and the noun is predicative3, as in tomake a decision and to pay a visit. Many light-verb constructions are similar to each other be-

3A noun is predicative if it has at least onesemantic argument, according to the PARSEMEguidelines (http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1).

http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1

http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1

81

cause of the predicative nature of the nouns butalso because they contain one of the few very fre-quent light verbs like make, take, etc. (Savaryet al., 2018). Note, however, that these verbs,are also highly frequent in regular constructions,i.e. Psim is moderate but Pdiscr is still restrictedto the level of types. Component similarities areweaker among inherently reflexive verbs, like (PL)znalezc sie ‘find oneself’. On the one hand, inher-ently reflexive verbs always contain a (mostly un-inflected) reflexive clitic (here: sie) governed bya verb. On the other hand, semantically similarverbs do not systematically form inherently reflex-ive verbs, e.g. (PL) wyszukac ‘find’ is a synonymof znalezc ‘find’ but *wyszukac sie ‘find oneself’is ungrammatical. Finally, verbal idioms, whichcover diverse syntactic structures, are largely dis-similar to each other but similar to regular con-structions, e.g. to take pains ‘to try hard’ is aMWE but to take aches is not.

The fourth property (Pambig) is the very low am-biguity of word combinations appearing in MWEs.These combinations are ambiguous because theycan occur both with idiomatic and with literalreadings, as in examples (1) vs. (2) below. Am-biguity is considered one of the major challengesposed by MWEs in NLP (Constant et al., 2017).However, recent work (Savary et al., 2019) showsthat, although most combinations of MWEs’ com-ponents could potentially be used literally, they arerarely used so in corpora. Namely, in 5 languagesfrom different language genera, the idiomaticityrate of verbal MWEs, i.e. the proportion of id-iomatic occurrences with respect to the total num-ber of idiomatic and literal occurrences, rangesfrom 0.96 to 0.98. This means that, wheneverthe morphosyntactic conditions for an idiomaticreading are fulfilled, this reading occurs almostalways. A similarly high idiomaticity rate (0.95)was also observed for Polish on other, non-verbalcategories of MWEs: nominal, adjectival, andadverbial GL-MWEs, as well as multiword NEs(Waszczuk et al., 2016). This property might berelated to the fact that ambiguity is reduced withthe addition of words to the context, a hypothe-sis that has been employed in word-sense disam-biguation for many years (Yarowsky, 1993).

(1) We often took pains not to harm them.‘We often tried hard not to harm them.’

(2) I could not::::take the

::::pain any longer.

Finally, the fifth property (Pzipf) we are inter-

ested in is the Zipfian distribution of MWEs. Asmost language phenomena, few MWE types oc-cur frequently in texts, and there is a long tailof MWEs occurring rarely (Ha et al., 2002; Ry-land Williams et al., 2015). The success of ma-chine learning generalization relies on dealingwith rare or unseen events, based on their similar-ity with frequent ones. Such similarity is hard todefine for the heterogeneous phenomena includedunder the MWE denomination.

3 State of the art in MWE identification

In this section we offer a comparative analysis ofstate-of-the-art results with respect to two axes:SL-MWEs vs. GL-MWEs and seen vs. unseendata. All results are indicated in terms of the F1-measure, with the exact-match metric. In otherwords, a prediction for a text fragment is consid-ered correct only when the identified unit corre-sponds to exactly the same words as in the goldstandard.4 For most SL-MWE results, the F1-measure additionally accounts for categorisation,i.e. a correctly identified span of words must alsobe assigned the correct NE category.

3.1 Identification of sublanguage MWEsFor SL-MWEs, identification methods have beendeveloped for decades, but most often fuse multi-word objects with single-word ones. Two typicalexamples are NE recognition and term identifica-tion. In these two domains, state-of-the-art resultshave been encouraging or good already in earlysystems and evaluation campaigns.

In the CoNLL 2002 and 2003 shared taskson NE recognition (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003), dedi-cated mainly to person, organization and locationnames, the top-3 systems obtained F1-measuresof 0.71, 0.74, 0.77, and 0.86, with datasets of20,000, 13,000, 18,000 and 35,000 annotated NEs,for German, Dutch, Spanish and English, respec-tively. All of these systems used machine learningtechniques such as hidden Markov models, deci-sion trees, MaxEnt classifiers, conditional randomfields, support-vector machines, recurrent neuralnetworks, with features that often included exter-nal entity list lookup.5 Yadav and Bethard (2018)provide more recent state-of-the-art results for NE

4The same metric is called MWE-based, as opposed totoken-based, in the PARSEME shared task campaigns.

5Results of the same systems without external entity listlookup are not provided.

82

recognition based on neural networks on the samedatasets. There, the best results mostly exceed0.78 for German, 0.85 for Dutch and Spanish, and0.9 for English, even without external dictionarylookup. In Slavic languages, where NE recogni-tion is substantially hardened by the rich declen-sion of nouns and adjectives, stable benchmarkingdata are still missing.6 Sample results can be citedin Polish, where relatively rich NE-annotated cor-pora and lexicons are available. Reference toolsachieve the F1-measure of 0.71 (Marcinczuk et al.,2017) and 0.77 (Waszczuk et al., 2013) with meth-ods based on conditional random fields.

As for term identification, several domain-specific benchmarking datasets allowed for sys-tem development and comparison. For instance,the best systems for biomedical term identificationobtain F1-measure of about 0.81, 0.85 and 0.88on disorder, chemical and gene/protein names, re-spectively (Campos et al., 2012).

While single-word and multiword NEs andterms are fused in the above results, good hints ex-ist that the results on multiword NEs and terms arecomparable or better than results on single-worditems. Firstly, the majority of NEs and terms incorpora consist of several words. For instance, inthe 110,000-token English Wiki50 corpus (Vinczeet al., 2011), around 65% of annotated NEs andterms consist of at least 2 words. Also in theJNLPBA and i2b2 shared tasks on biomedical andmedical NE recognition, 55% and 58%, respec-tively, of all terms are multiword terms (Camposet al., 2012). Secondly, some NE recognition ef-forts were explicitly dedicated to boosting per-formance for multiword NEs and terms. For in-stance, Downey et al. (2007) achieve F1=0.74 onthe recognition of multiword named entities in aweb corpus with a very simple system based onn-gram statistics. A baseline system using bidi-rectional recurrent neural networks (BiLSTM) byCampos et al. (2012) achieves the F1-measure of0.74 and 0.81 on bigrams, which are the most fre-quent multiword terms in the i2b2 and JNLPBAcorpora.

3.2 Identification of general-language MWEs

Within GL-MWEs, multilingual benchmarkingdata are available mainly for verbal MWEs viaeditions 1.0 and 1.1 of the PARSEME shared tasks

6In the first shared task on NE recognition in Balto-Slaviclanguages (Piskorski et al., 2017), only test data but no anno-tated training data were published.

(Savary et al., 2017; Ramisch et al., 2018). In edi-tion 1.1, the scores (across 19 languages) for thetop-3 systems range from 0.5 to 0.58. The per-language scores vary greatly due to corpus sizevariety and typological differences between lan-guages. Table 1 shows the corpus sizes and thebest system F1-measure for the 6 languages whosecorpora contain at least 5,000 annotated verbalMWEs.7 The results of the best systems, with andwithout neural networks, never exceed 0.68, withthe exception of Romanian, which has a low per-centage of unseen data in the test corpus.

BG FR PL PT RO TR#verbal MWEs 6.7K 5.7K 5.2K 5.5K 5.9K 7.1Kunseen ratio .33 .50 .28 .28 .05 .75Best non-NN F1 .63 .56 .67 .62 .83 .45Best NN F1 .66 .61 .64 .68 .87 .59

Table 1: Sizes of the corpora (in thousands of anno-tated verbal MWEs), the ratio of unseen verbal MWEsin the test corpora and the best system performance,without (non-NN) and with neural networks (NN), inthe PARSEME shared task 1.1 for 6 languages with thelargest corpora.

These results are not directly comparable tothose from Sec. 3.1 because evaluation measurespartly differ (e.g., NE recognition includes cate-gorisation), the sets of languages hardly overlap,and corpus sizes are largely below those of theCoNLL corpora.8 Still, it is clear that MWEI isa particularly hard problem and it is important tounderstand the vulnerabilities (if any) of currentapproaches.

3.3 Challenges of unseen dataThe PARSEME shared task 1.1 introducedphenomenon-specific evaluation measures which

7Hungarian is left out because its corpus consists of spe-cialized law texts. Language codes in the tables are: Bulgar-ian (BG), French (FR), Polish (PL), Portuguese (PT), Roma-nian (RO), Turkish (TR).

8The PARSEME shared task 1.0 results for Czech, with12,000 annotated verbal MWEs, come up to F1 = 0.72with a non-neural system. This might be comparable to theCoNLL-2002 results for Dutch, with 13,000 annotated NEsand the top F1-measure of 0.74 for a non-neural system.However, as many as 69% of the annotated verbal MWEsin the Czech corpus are inherently reflexive verbs (IRVs),such as se bavit ‘amuse oneself’⇒‘play’, which are rela-tively easy to predict due to the moderate strength of Psim.The Czech corpus was not annotated from scratch but con-verted from a previously annotated resource, and inherentlyreflexive verbs are probably over-represented there. The rateof inherently reflexive verbs in other Slavic languages in thePARSEME corpora range from 0.3 to 0.48.

83

focus on known challenges posed by MWEs.Thus, results were reported separately for con-tinuous vs. discontinuous, multi-token vs. single-token, seen vs. unseen, and identical-to-trainvs. variant-of-train verbal MWEs.9 The most dra-matic performance differences appear in the seenvs. unseen opposition. A verbal MWE from thecorpus is considered seen if another verbal MWEwith the same multiset of lemmas is annotated atleast once in the training corpus. For instance,given the occurrence of has a new look in thetraining corpus, the following verbal MWEs fromthe test corpus would be considered:

• seen: has a new look, had an appealing look,has a look of innocence, the look that he had

• unseen: has a look at this report, gave a lookto the book, walk that he had, took part, etc.

Tab. 2 shows the PARSEME shared task 1.1results achieved on seen and unseen data for 3of the 6 previously analysed languages. Frenchand Turkish were left out since no lemmas areprovided for 20-30% of their test data. Ro-manian is skipped because only 5% of its testcorpus corresponds to unseen data. We focuson the overall best systems in the closed andopen track10: TRAVERSAL (Waszczuk, 2018)and SHOMA (Taslimipoor and Rohanian, 2018).The former applies sequential conditional ran-dom fields extended to tree structures, while thelatter feeds word embeddings to convolutionaland recurrent neural networks, which are givento a decision layer based on conditional randomfields. On unseen data in the 3 languages un-der study, TRAVERSAL’s score never exceeds0.20, and the performance is 3.9 (for Portuguese)to 6.1 (for Bulgarian) times worse than on seendata. SHOMA’s generalization power is greater: itachieves a score of 0.18 (for Polish) to 0.31 (forBulgarian and Portuguese) on unseen data, whichis still 2.5 (for Portuguese) to 4.6 (for Polish) timesworse than for seen expressions.

It is also interesting to see which unseen verbalMWEs categories have been correctly identifiedby both systems. Tab. 2 reveals that generalizationis the strongest for inherently reflexive verbs and

9http://multiword.sourceforge.net/sharedtaskresults2018

10In the closed track, systems are only allowed to use theprovided training/development data. In the open track, theycan additionally use external resources (lexicons, word em-beddings, language models trained on external data, etc.).

light-verb constructions, likely due to the moder-ate inter-MWE component similarity (Psim) dis-cussed in Sec. 2. Still, it is far below the general-ization power in SL-MWEs (see below), probablybecause Pdiscr is related to types but not tokens.

As far as SL-MWE identification is concerned,we are aware of only one study explicitly dedi-cated to the impact of unseen data. Namely, Au-genstein et al. (2017) compare the performance of3 state-of-the-art named-entity recognition toolson 19 NE-annotated datasets in English. Forthe CoNLL corpora cited in Sec. 3.1, the scoresachieved on unseen data range from 0.81 to 0.94.The scores for out-of-domain unseen data are sig-nificantly lower but still exceed 0.61 for the 2 bestsystems. Unseen NEs are defined in this study asthose with surface forms present only in the test,but not in the training data, which differs from thePARSEME shared task 1.1 definition (where datawith different surface forms are considered seen ifthey have seen multisets of lemmas). Still, mor-phosyntactic variability in English NEs should berelatively low, therefore we may safely deduce thatMWEI on unseen data performs significantly bet-ter on SL-MWEs in a morphologically-poor lan-guage than on GL-MWEs in morphologically-richlanguages. We believe that this is more related tothe SL-MWE vs. GL-MWEs distinction than to ty-pological differences between languages.11

To conclude, the challenges posed by unseendata to MWEI seem significantly harder for GL-MWEs than for SL-MWEs. We attribute this factto the different nature of the two phenomena. SL-MWEs differ from regular word combinations atthe level of tokens (Pdiscr) and exhibit strong sim-ilarities among components (Psim). These proper-ties can be leveraged by machine learning tools,whether supervised (e.g. using character-level fea-tures or word embeddings, to account for surfaceand semantic similarity of NEs components, re-spectively) or unsupervised (e.g. based on con-trastive measures for terms), notably to general-ize over unseen data. Conversely, GL-MWEs aremostly idiosyncratic at the level of types but nottokens (Pdiscr) and show moderate or weak com-ponent similarities (Psim). These charateristics arehard to tackle by systems which model MWEI as atagging problem, except if features based on type-

11PARSEME shared task 1.1 results for identical-to-trainvs. variant-of-train items, presented in the next section, cor-roborate this intuition: TRAVERSAL and SHOMA handlemorphosyntactic variability much better than lexical novelty.

http://multiword.sourceforge.net/sharedtaskresults2018

http://multiword.sourceforge.net/sharedtaskresults2018

84

BG PL PTIRV LVC VID All IRV LVC VID All IRV LVC VID All

TRAVERSAL seen .89 .63 .55 .76 .92 .76 .57 .85 .89 .77 .69 .78unseen .26 .06 .07 .13 .26 .20 .04 .17 .12 .25 .07 .20

SHOMA seen .92 .65 .58 .78 .90 .69 .58 .82 .86 .88 .84 .87unseen .59 .21 .10 .31 .24 .19 .04 .18 .42 .35 .08 .31

Table 2: PARSEME shared task 1.1 identification scores on seen and unseen data for TRAVERSAL and SHOMA.Verbal MWE categories are inherently reflexive verbs (IRVs), light-verb constructions (LVCs) and verbal idioms(VIDs).

specific idiosyncrasies are used. The few token-specific hints (if any) which may help such sys-tems generalize over unseen data are mostly lim-ited to the presence of particular light verbs orfunction words. Their role resembles the one oftrigger words and nested entities in NE recogni-tion (Sec. 2), but, differently from the latter, theyare also highly frequent in regular constructions,which hinders their discriminative power for GL-MWEs.

3.4 Progress potential in seen dataSince unseen GS-MWEs prove drastically hard toidentify, it is interesting to understand how muchprogress might be achieved on seen data. We be-lieve that this potential of improvement is rela-tively high due to several factors.

Firstly, the low effective ambiguity of MWEs(Pambig) means that identifying morphosyntac-tically well-formed combinations of previouslyseen MWE components constitutes a strong base-line for MWEI. For instance, Pasquer et al.(2018b) propose a very simple baseline for verb-noun MWE identification in which previouslyseen verb-noun pairs are tagged as MWEs as soonas they have the same lemmas as a seen MWE andmaintain a direct dependency relation, whateverthe label and direction of this dependency. Thisvery simple method achieves F1=0.88 on French.A comparable result was observed in the 2016DiMSUM shared task (Schneider et al., 2014), inwhich a rule-based baseline was ranked second.This system extracted MWEs from the trainingcorpus and then annotated them in the test cor-pus based on lemma/part-of-speech matching andheuristics such as allowing a limited number of in-tervening words (Cordeiro et al., 2016).

Secondly, there is a large gap to bridge forseen data whose surface form is not identical tothe ones seen in train. Tab. 3 shows that, in-deed, the difference between identical-to-train and

variant-of-train scores ranges from 0.12 (in Polishfor TRAVERSAL and Portuguese for SHOMA) to0.37 (in Bulgarian for SHOMA). At the same time,Pasquer et al. (2018a) show that morphosyntacticvariability, relatively high in verbal MWEs, can beneutralized with dedicated methods. Namely, co-occurrences of previously seen MWE componentscan be effectively recognized by a Naive Bayesclassifier, with features leveraging type-specificidiosyncrasies (Pdiscr). This method scored thebest in the PARSEME shared task 1.1 for Bulgar-ian, even if it was restricted to the seen data only.

BG PL PT

TRAVERSAL identical to train .85 .92 .87variants of train .55 .80 .72

SHOMA identical to train .89 .95 .93variants of train .52 .71 .81

Table 3: PARSEME shared task 1.1 identificationscores on identical-to-train and variant-of-train data forTRAVERSAL and SHOMA.

Thirdly, significant progress can also beachieved if another important challenge is explic-itly addressed: discontinuity of verbal MWEs. Forinstance, Rohanian et al. (2019) employ neuralmethods combining convolution and self-attentionmechanisms and obtain impressive improvementsover the best PARSEME shared task systems.

Finally, not only annotated training corpora butalso MWE lexicons can provide information aboutseen data. The two next sections describe the stateof the art in lexical description of MWEs, and in-tegration of MWE lexicons in NLP methods.

4 Lexicons of MWEs

Describing MWEs in dictionaries dedicated to hu-man users has a long-standing lexicographic tra-dition, but its synergies with NLP have not beenstraightforward (Gantar et al., 2018). More for-mal linguistic modeling of MWEs has also beencarried out for decades, notably in the frameworks

85

of the Lexicon Grammar (Gross, 1986) and of theExplanatory Combinatorial Dictionary (Mel’cuket al., 1988; Pausé, 2018). These approaches as-sume that units of meaning are located at the levelof elementary sentences (predicates with their ar-guments) rather than of words, and MWEs, espe-cially verbal, are special instances of predicatesin which some arguments are lexicalized. Thoseworks paved the way towards systematic syntac-tic description of MWEs, but suffered from insuf-ficient formalization and required substantial ac-commodation to be applicable to NLP (Constantand Tolone, 2010; Lareau et al., 2012).

With the growing understanding of the chal-lenges which MWEs pose to NLP, a large num-ber of (fully or partly) NLP-dedicated lexiconshave been created for many languages (Losne-gaard et al., 2016). These resources can be clas-sified notably along 3 axes, according to (i) the ac-count of the morpho-syntactic structure of a MWEand its variants, (ii) lexicon-corpus coupling, (iii)number of entries.

Along axis (i), there is a gradation in the com-plexity of the related formalisms. The simplest areraw lists of MWEs, sometimes accompanied withselected morphosyntactic variants, collected fromlarge corpora or automatically generated (Stein-berger et al., 2011).

More elaborate are approaches based on finite-state-related formalisms. They usually indicate themorphological categories and features of individ-ual MWE components, and offer rule-based com-binatorial description of their variability patterns(Karttunen et al., 1992; Breidt et al., 1996; Oflazeret al., 2004; Silberztein, 2005; Krstev et al., 2010;Al-Haj et al., 2014; Lobzhanidze, 2017; Czere-powicka and Savary, 2018). They mostly covercontinuous (e.g. nominal) MWEs in which mor-phosyntactic phenomena remain local (Savary,2008). Therefore, additionally to the intentionalformat, i.e. rules describing the analysis and pro-duction of MWE instances, they often come withan extensional format, which stores the MWE in-stances (inflected forms) themselves. Plain-textextensional lists can be straightforwardly matchedagainst a text. Such finite-state frameworks donot account for deep syntax and for interactionsof MWE lexicalized components with external el-ements. Therefore, they are not well adapted toverbal MWEs.

Finally, there exist syntactic lexicons in which

MWEs are most often covered jointly with sin-gle words. On the one hand, there are ap-proaches meant to be theory-neutral (Grégoire,2010; Przepiórkowski et al., 2017; McShane et al.,2015), i.e. they implicitly assume the existenceof regular grammar rules, and explicitly describeonly those MWE properties which do not con-form to these rules. Although these lexicons suf-fer from insufficient formalization (Lichte et al.,2019), they could be successfully applied to pars-ing after ad hoc conversion to particular grammarformalisms. On the other hand, some approachesaccommodate some types of MWEs directly inthe lexicons of computational grammars withinparticular grammatical frameworks: head-drivenphrase structure grammar (Sag et al., 2002; Copes-take et al., 2002; Villavicencio et al., 2004; Bondet al., 2015; Herzig Sheinfux et al., 2015), lexi-cal functional grammar (Attia, 2006; Dyvik et al.,2019), tree-adjoining grammar (Abeillé and Sch-abes, 1989, 1996; Vaidya et al., 2014; Lichte andKallmeyer, 2016), and dependency grammar (Di-aconescu, 2004).

Along axis (ii), most recent approaches are usu-ally coupled with corpora, but to a different de-gree. PDT-Vallex (Urešová, 2012) is a Czech va-lency dictionary fully aligned with the Prague De-pendency Treebank, i.e. new frames were addedas they were encountered during manual annota-tion of the corpus. These frames are also linkedto their corpus instances. Similarly, SemLex (Be-jcek and Stranák, 2010), is a MWE lexicon boot-strapped from pre-existing dictionaries (not neces-sary corpus-based) and further developed hand-in-hand with the PDT annotation. It contains syntac-tic structures of MWE entries to which corpus oc-currences are linked. In Walenty (Przepiórkowskiet al., 2014), a Polish valency dictionary, the ini-tial set of entries stems from pre-existing single-word e-dictionaries, which were then extended toMWEs and described as exhaustively as possibleas to their valency frames. All frames are doc-umented with attested examples, preferably butnot necessarily from the National Corpus of Pol-ish. In DUELME (Grégoire, 2010), a Dutch MWElexicon, all MWE were automatically acquiredfrom a large raw corpus on the basis of a shortlist of morpho-syntactic patterns. Lexicon entriescontain example sentences illustrating the use ofMWEs. Finally, when MWEs were directly ac-commodated in implemented formal grammars,

86

the choice of MWEs to model is rarely docu-mented but was probably motivated by a possiblyhigh syntactic and semantic variety of construc-tions rather than by corpus frequencies, even if at-tested examples support the grammar engineering.

Along axis (iii), the sizes of the existing MWElexical resources vary greatly, from several dozento several tens of thousands of MWE entries. Thiscoverage is often inversely correlated with therichness and precision of the linguistic description.

5 MWE lexicons in MWE identification

Handcrafted MWE lexicons, as those addressedin the previous section, can significantly enhanceMWEI. In sequence tagging MWEI methods, suchresources can be used as sources of lexical fea-tures (Schneider et al., 2014). In parsing-based ap-proaches they may serve as a basis for word-latticerepresentation of an input sentence, in which thecompositional vs. MWE interpretation of a wordsequence is represented jointly (Constant et al.,2013). The impact of lexical resources on MWEIis explicitly addressed by Riedl and Biemann(2016). Using a CRF-based MWEI system, theyshow that the addition of an automatically discov-ered lexicon of MWEs can benefit MWEI quality.

The systems competing in PARSEME sharedtasks used lexical resources to a much lesser de-gree. In both editions only one, rule-based, sys-tem applied a MWE lexicon, for French in edition1.0 and for English, French, German and Greekin edition 1.1 (Nerima et al., 2017). Other sys-tems, even those from the open track, employedonly one type of external resources, namely wordembeddings, but no MWE lexicons. This is prob-ably due mainly to the fact that the competitionwas meant to promote cross-lingual methods, butfew or no MWE lexical frameworks offer largeMWEs lexicons for many languages. The re-sources covered by the (Losnegaard et al., 2016)survey are numerous and cover at least 19 lan-guages, but their formats are not uniform so MWEidentifiers cannot easily integrate them. Anotherreason might be that the complex constraints im-posed by MWEs, especially verbal ones, call forcomplex formalisms, whose expressive power ishard to accommodate with mainstream machinelearning methods. Still, current MWEI identifiersare able to benefit from rich joint syntactic andMWE annotation, notably to neutralize variability(cf. Sec. 3).

6 Towards syntactic lexicons for MWEidentification

As discussed in Sec. 2, MWEs exhibit a Zipfiandistribution (Pzipf), which means that the powerto generalize over unseen data is crucial for high-quality MWEI. However, as seen in Sec. 3, cur-rent MWEI methods badly fail on unseen data. Atthe same time, performance on seen items can bevery high if morphosyntactic variability is appro-priately accounted for.

The straightforward idea is then to maximizethe quantity of the seen data. This proposal is ofcourse trivial with respect to most learning prob-lems in NLP. But we believe that its applicabilityis particularly relevant in the domain of GL-MWEidentification for at least four reasons. Firstly,there is a particularly acute discrepancy betweenthe performance on seen vs. unseen data, as dis-cussed in Sec. 3, so the potential of the gain in thisrespect is huge. Secondly, unsupervised discov-ery of (previously unseen) MWEs has a rich bib-liography and proves particularly effective whentype-specific idiosyncrasies are exploited (Pdiscr),for instance, in verb-noun idiom discovery (Fazlyet al., 2009). Thirdly, the low effective ambiguityof word combinations occurring in MWEs (Pambig)implies scarcity of naturally occurring negative ex-amples. Therefore, the Zipfian distribution (Pzipf)can be partly balanced, with minor bias, by com-plementing a (small) annotated corpus with sev-eral minimal positive occurrence examples forlower-frequency MWEs discovered in very largecorpora by unsupervised methods. Fourthly, therelatively low proliferation speed (Pprolif) of GL-MWEs makes them good candidates for large-coverage lexical encoding. Thus, it should be pos-sible to produce relatively stable and high-qualitylexical resources via manual validation of unsu-pervised discovery methods.

The conclusions from Sec. 4 and 5 also speakin favor of the use of lexical MWE resources inMWEI, especially if they are offered in a unifiedformat for many languages, and if they carry infor-mation similar to what can be found in treebanks.

These observations lead us to propose the fol-lowing scenario for future development in MWEI.

• Automatic identification of GL-MWEsshould be systematically coupled with MWEdiscovery via syntactic lexicons.

• In such lexicons, for each MWE type, one

87

should be able to retrieve at least: (i) thelemmas and parts of speech of its lexical-ized components, (ii) its syntactically leastmarked dependency structure preserving theidiomatic reading (Savary et al., 2019),12 (iii)the description of some of its morphosyntac-tic variants13 preserving the idiomatic read-ing, e.g. those judged most frequent or mostdiscriminating.

• If the lexicon is stored in an intentional for-mat, it should be distributed with its exten-sional equivalent. The simplest form of anextensional format is a set of corpus exam-ples for each MWE entry, with syntactic andMWE annotation.

• The extensional format should be compatiblewith standard corpus formats,14 so as to re-quire minimal effort from corpus-based toolsin completing the existing corpora with thelexicon examples.

• The lexicon should encode with high prioritythose MWEs which occur rarely or never inthe reference corpora, i.e. the corpora anno-tated for MWEs and used for training MWEidentifiers. This is in sharp contrast to the ex-isting NLP-oriented MWE lexicons more orless strongly coupled with reference corpora(see Sec. 4).

Note that exhaustiveness of this description, andnotably of the morphosyntactic variation, is not re-quired. This feature should make the lexical en-coding adventure relatively feasible, with the helpof fully and/or semi-automatic methods.

7 Roadmap

To complement the proposal of MWE discov-ery/identification interface from the previous sec-tion, we suggest that the MWE community shouldmore thoroughly address the challenges posed toMWEI by unseen data. In the short run, fu-ture shared tasks on MWEI might, for instance,

12A form with a finite verb is less marked than one with aninfinitive or a participle, a non-negated form is less markedthan a negated one, the active voice is less marked than thepassive, a form with an extraction is more marked than with-out, etc.

13Following (Savary et al., 2019), we understand a vari-ant of a given MWE as a set of all its occurrences sharingthe same coarse syntactic structure, i.e. the same lexicalizedlemmas, POS and dependency relations.

14PARSEME corpora for verbal MWEs usean extension of the CoNNL-U format (https://universaldependencies.org/format.html) called cupt(http://multiword.sourceforge.net/cupt-format)

propose subtasks dedicated specifically to unseendata. New MWEI tools may leverage the type-specific idiosyncrasy of MWEs (Pdiscr), so as toachieve better generalization over unseen data.

The community should also put more effortinto the development of large-coverage syntacticMWE lexicons. To this end, the MWE discov-ery task should be redefined so that not only barelists of MWE candidates but also their syntacticstructures for at least some morphosyntactic vari-ants are extracted (Weller and Heid, 2010). Manyexisting discovery methods are dedicated to se-lected MWE categories, syntactic patterns and lan-guages. New methods should, conversely, be moregeneric so as to cover the large variety of MWEcategories and adapt to many languages. In or-der to incrementally achieve high quality for suchresources (e.g. via manual validation), MWE dis-covery should not be performed from scratch, butshould take as input and enrich existing MWElexicons. MWE discovery evaluation measuresshould explicitly account for this enrichment as-pect.

Steps should also be taken towards definingMWE lexicon formats which would be compat-ible with the recommendations from Sec. 6. Tothis end, a shared task on lexicon format defini-tions and/or lexicon construction methods couldbe organized. A mid-long-term objective of thecommunity would then be to produce unified mul-tilingual reference datasets which would consistboth of MWE-annotated corpora (extended to new,non-verbal MWE categories) and of NLP-orientedMWE lexicons. We believe that these steps arenecessary to bridge the performance gap betweenMWEI and other NLP tasks, so that MWEI be-comes a regular component of traditional NLP textanalysis pipelines.

Acknowledgments

This work was funded by the French PARSEME-FR project (ANR-14-CERA-0001).15 We aregrateful to Jakub Waszczuk and Kilian Evang fortheir valuable feedback at an early stage of ourproposal. We also thank the anonymous review-ers for their useful comments.

15http://parsemefr.lif.univ-mrs.fr/

https://universaldependencies.org/format.html

https://universaldependencies.org/format.html

http://multiword.sourceforge.net/cupt-format

88

ReferencesAnne Abeillé and Yves Schabes. 1989. Parsing id-

ioms in lexicalized TAGs. In Proceedings of the4th Conference of the European Chapter of the ACL,EACL’89, Manchester, pages 1–9.

Anne Abeillé and Yves Schabes. 1996. Non-compositional discontinuous constituents in TreeAdjoining Grammar. In Harry Bunt and Arthur vanHorck, editors, Discontinuous Constituency, pages279–306. Mouton de Gruyter, Berlin, Germany.

Hassan Al-Haj, Alon Itai, and Shuly Wintner. 2014.Lexical representation of multiword expressions inmorphologically-complex languages. InternationalJournal of Lexicography, 27(2):130–170.

Mohammed A. Attia. 2006. Accommodating multi-word expressions in an Arabic LFG grammar. InProceedings of the 5th international conference onAdvances in Natural Language Processing, Fin-TAL’06, pages 87–98, Berlin. Springer.

Isabelle Augenstein, Leon Derczynski, and KalinaBontcheva. 2017. Generalisation in named entityrecognition. Comput. Speech Lang., 44(C):61–83.

Timothy Baldwin and Su Nam Kim. 2010. Multiwordexpressions. In Nitin Indurkhya and Fred J. Dam-erau, editors, Handbook of Natural Language Pro-cessing, 2 edition, pages 267–292. CRC Press, Tay-lor and Francis Group, Boca Raton, FL, USA.

Eduard Bejcek and Pavel Stranák. 2010. Annota-tion of multiword expressions in the Prague depen-dency treebank. Language Resources and Evalua-tion, 44(1–2):7–21.

Francis Bond, Jia Qian Ho, and Dan Flickinger. 2015.Feeling our way to an analysis of English pos-sessed idioms. In Proceedings of the 22nd Interna-tional Conference on Head- Driven Phrase Struc-ture Grammar, pages 61–74, Stanford, CA. CSLIPublications.

Elisabeth Breidt, Frédérique Segond, and GuiseppeValetto. 1996. Formal Description of Multi-WordLexemes with the Finite-State Formalism IDAREX.In Proceedings of COLING-96, Copenhagen, pages1036–1040.

David Campos, Sérgio Matos, and José Luís Oliveira.2012. Biomedical named entity recognition: A sur-vey of machine-learning tools. In Shigeaki Sakurai,editor, Theory and Applications for Advanced TextMining, chapter 8. IntechOpen, Rijeka.

Matthieu Constant, Gülsen Eryigit, Johanna Monti,Lonneke van der Plas, Carlos Ramisch, MichaelRosner, and Amalia Todirascu. 2017. Multiword ex-pression processing: A survey. Computational Lin-guistics, 43(4):837–892.

Matthieu Constant, Joseph Le Roux, and Anthony Si-gogne. 2013. Combining compound recognition

and PCFG-LA parsing with word lattices and condi-tional random fields. TSLP Special Issue on MWEs:from theory to practice and use, part 2 (TSLP),10(3).

Matthieu Constant and Elsa Tolone. 2010. A generictool to generate a lexicon for NLP from Lexicon-Grammar tables. In Michele De Gioia, editor, Actesdu 27e Colloque international sur le lexique et lagrammaire (L’Aquila, 10-13 septembre 2008). Sec-onde partie, volume 1 of Lingue d’Europa e delMediterraneo, Grammatica comparata, pages 79–93. Aracne. ISBN 978-88-548-3166-7.

Ann Copestake, Fabre Lambeau, Aline Villavicencio,Francis Bond, Timothy Baldwin, Ivan A. Sag, andDan Flickinger. 2002. Multiword expressions: lin-guistic precision and reusability. In Proceedings ofLREC 2002.

Silvio Cordeiro, Carlos Ramisch, and Aline Villavicen-cio. 2016. UFRGS&LIF at SemEval-2016 task 10:Rule-based MWE identification and predominant-supersense tagging. In Proceedings of the 10thInternational Workshop on Semantic Evaluation(SemEval-2016), pages 910–917, San Diego, Cal-ifornia, USA. Association for Computational Lin-guistics.

Monika Czerepowicka and Agata Savary. 2018. SEJF -A Grammatical Lexicon of Polish Multiword Expres-sions, volume 10930 of Lecture Notes in ComputerScience. Springer Cham.

Stefan Diaconescu. 2004. Multiword expression trans-lation using generative dependency grammar. InAdvances in Natural Language Processing. EsTAL2004, volume 3230 of Lecture Notes in Com-puter Science, pages 243–254, Berlin, Heidelberg.Springer.

Doug Downey, Matthew Broadhead, and Oren Et-zioni. 2007. Locating complex named entities inweb text. In Proceedings of the 20th InternationalJoint Conference on Artifical Intelligence, IJCAI’07,pages 2733–2739, San Francisco, CA, USA. Mor-gan Kaufmann Publishers Inc.

Helge Dyvik, Gyri Smørdal Losnegaard, and Victo-ria Rosén. 2019. Multiword expressions in an LFGgrammar for Norwegian. In Yannick Parmentier andJakub Waszczuk, editors, Representation and Pars-ing of Multiword Expressions, pages 41–72. Lan-guage Science Press, Berlin.

Stefan Evert. 2005. The statistics of word co-occurrences: Word pairs and collocations. Ph.D.thesis, Univ. of Stuttgart, Stuttgart, Germany.

Afsaneh Fazly, Paul Cook, and Suzanne Stevenson.2009. Unsupervised type and token identification ofidiomatic expressions. Computational Linguistics,35(1):61–103.

http://dblp.uni-trier.de/db/conf/eacl/eacl1989.html#AbeilleS89

http://dblp.uni-trier.de/db/conf/eacl/eacl1989.html#AbeilleS89

https://doi.org/10.1007/11816508_11

https://doi.org/10.1007/11816508_11

https://doi.org/10.1016/j.csl.2017.01.012

https://doi.org/10.1016/j.csl.2017.01.012

http://cslipublications.stanford.edu/HPSG/2015/bhf.pdf

http://cslipublications.stanford.edu/HPSG/2015/bhf.pdf

https://doi.org/10.5772/51066

https://doi.org/10.5772/51066

https://doi.org/10.1162/COLI_a_00302

https://doi.org/10.1162/COLI_a_00302

http://www.aracneeditrice.it/aracneweb/index.php/catalogo/9788854831667-detail.html



https://doi.org/10.18653/v1/S16-1140

https://doi.org/10.18653/v1/S16-1140

https://doi.org/10.18653/v1/S16-1140

https://doi.org/10.1007/978-3-540-30228-5_22

https://doi.org/10.1007/978-3-540-30228-5_22

http://dl.acm.org/citation.cfm?id=1625275.1625715

http://dl.acm.org/citation.cfm?id=1625275.1625715

https://doi.org/10.1162/coli.08-010-R1-07-048

https://doi.org/10.1162/coli.08-010-R1-07-048

89

Polona Gantar, Lut Colman, Carla Parra Escartín, andHéctor Martínez Alonso. 2018. Multiword Expres-sions: Between Lexicography and NLP. Interna-tional Journal of Lexicography.

Nicole Grégoire. 2010. DuELME: a Dutch electroniclexicon of multiword expressions. Language Re-sources and Evaluation, 44(1-2).

Maurice Gross. 1986. Lexicon-grammar: The Repre-sentation of Compound Words. In Proceedings ofthe 11th Coference on Computational Linguistics,COLING ’86, pages 1–6, Stroudsburg, PA, USA.Association for Computational Linguistics.

Le Quan Ha, E. I. Sicilia-Garcia, Ji Ming, and F. J.Smith. 2002. Extension of Zipf’s law to words andphrases. In Proceedings of the 19th InternationalConference on Computational Linguistics - Volume1, COLING ’02, pages 1–6, Stroudsburg, PA, USA.Association for Computational Linguistics.

Livnat Herzig Sheinfux, Tali Arad Greshler, Nurit Mel-nik, and Shuly Wintner. 2015. Hebrew verbal multi-word expressions. In Proceedings of the 22nd Inter-national Conference on Head-Driven Phrase Struc-ture Grammar, Nanyang Technological University(NTU), Singapore, pages 122–135, Stanford, CA.CSLI Publications.

Lauri Karttunen, Ronald M. Kaplan, and Annie Zae-nen. 1992. Two-Level Morphology with Composi-tion. In Proceedings of COLING-92, Nantes, pages141–148.

Cvetana Krstev, Ranka Stankovic, Ivan Obradovic,Duško Vitaš, and Milos Utvic. 2010. Auto-matic Construction of a Morphological Dictionaryof Multi-Word Units. LNAI, 6233:226–237.

François Lareau, Mark Dras, Benjamin Boerschinger,and Myfany Turpin. 2012. Implementing lexicalfunctions in xle.

Timm Lichte and Laura Kallmeyer. 2016. Same syn-tax, different semantics: A compositional approachto idiomaticity in multi-word expressions. In Empir-ical Issues in Syntax and Semantics 11, pages 111–140, Paris. CSSP.

Timm Lichte, Simon Petitjean, Agata Savary, andJakub Waszczuk. 2019. Lexical encoding formatsfor multi-word expressions: The challenge of “irreg-ular” regularities. In Yannick Parmentier and JakubWaszczuk, editors, Representation and Parsing ofMultiword Expressions, pages 41–72. Language Sci-ence Press, Berlin.

Irina Lobzhanidze. 2017. Computational Model ofModern Georgian Language and Searching Patternsfor On-line Dictionary of Idioms. In Twelfth Inter-national Tbilisi Symposium on Language, Logic andComputation 18-22 September, 2017, Lagodekhi,Georgia.

Gyri Smørdal Losnegaard, Federico Sangati,Carla Parra Escartín, Agata Savary, SaschaBargmann, and Johanna Monti. 2016. Parsemesurvey on mwe resources. In Proceedings ofthe Tenth International Conference on LanguageResources and Evaluation (LREC 2016), Paris,France. European Language Resources Association(ELRA).

Michał Marcinczuk, Jan Kocon, and Marcin Oleksy.2017. Liner2 — a generic framework for namedentity recognition. In Proceedings of the 6th Work-shop on Balto-Slavic Natural Language Processing,pages 86–91, Valencia, Spain. Association for Com-putational Linguistics.

Marjorie McShane, Sergei Nirenburg, and StephenBeale. 2015. The Ontological Semantic treatment ofmultiword expressions. Lingvisticæ Investigationes,38(1):73–110.

Igor Mel’cuk, Nadia Arbatchewsky-Jumarie, LouiseDagenais, Léo Elnitsky, Lidija Iordanskaja, Marie-Noëlle Lefebvre, and Suzanne Mantha. 1988. Dic-tionnaire explicatif et combinatoire du français con-temporain: Recherches lexico-sémantiques, vol-ume II of Recherches lexico-sémantiques. Pressesde l’Univ. de Montréal.

Luka Nerima, Vasiliki Foufi, and Eric Wehrli. 2017.Parsing and MWE detection: Fips at the PARSEMEshared task. In Proceedings of the 13th Workshopon Multiword Expressions (MWE 2017), pages 54–59, Valencia, Spain. Association for ComputationalLinguistics.

Kemal Oflazer, Özlem Çetonoglu, and Bilge Say. 2004.Integrating Morphology with Multi-word Expres-sion Processing in Turkish. In Second ACL Work-shop on Multiword Expressions, July 2004, pages64–71.

Caroline Pasquer, Carlos Ramisch, Agata Savary, andJean-Yves Antoine. 2018a. VarIDE at PARSEMEShared Task 2018: Are variants really as alike as twopeas in a pod? In Proceedings of the Joint Workshopon Linguistic Annotation, Multiword Expressionsand Constructions (LAW-MWE-CxG-2018), pages283–289. Association for Computational Linguis-tics.

Caroline Pasquer, Agata Savary, Carlos Ramisch, andJean-Yves Antoine. 2018b. If you’ve seen some,you’ve seen them all: Identifying variants of multi-word expressions. In Proceedings of COLING 2018,the 27th International Conference on ComputationalLinguistics. The COLING 2018 Organizing Com-mittee.

Marie-Sophie Pausé. 2018. Modelling french idiomsin a lexical network. Studi e Saggi Linguistici,55(2):137–155.

Pavel Pecina. 2008. Lexical association measures:Collocation extraction. Ph.D. thesis, Faculty of

https://doi.org/10.1093/ijl/ecy012

https://doi.org/10.1093/ijl/ecy012

https://doi.org/10.3115/991365.991367

https://doi.org/10.3115/991365.991367

https://doi.org/10.3115/1072228.1072345

https://doi.org/10.3115/1072228.1072345

http://cslipublications.stanford.edu/HPSG/2015/hamw.pdf

http://cslipublications.stanford.edu/HPSG/2015/hamw.pdf

https://doi.org/10.13140/2.1.2869.9201

https://doi.org/10.13140/2.1.2869.9201

http://www.cssp.cnrs.fr/eiss11/



https://doi.org/10.18653/v1/W17-1413

https://doi.org/10.18653/v1/W17-1413

http://books.google.fr/books?id=zwObmgEACAAJ



https://doi.org/10.18653/v1/W17-1706

https://doi.org/10.18653/v1/W17-1706

http://aclweb.org/anthology/W18-4932



https://www.studiesaggilinguistici.it/index.php/ssl/article/view/210

https://www.studiesaggilinguistici.it/index.php/ssl/article/view/210

90

Mathematics and Physics, Charles Univ. in Prague,Prague, Czech Republic.

Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, JosefSteinberger, and Roman Yangarber. 2017. The firstcross-lingual challenge on recognition, normaliza-tion, and matching of named entities in Slavic lan-guages. In Proceedings of the 6th Workshop onBalto-Slavic Natural Language Processing, pages76–85, Valencia, Spain. Association for Computa-tional Linguistics.

Adam Przepiórkowski, Jan Hajic, Elzbieta Hajnicz,and Zdenka Urešová. 2017. Phraseology in twoSlavic valency dictionaries: Limitations and per-spectives. International Journal of Lexicography,30(1):1–38.

Adam Przepiórkowski, Elzbieta Hajnicz, AgnieszkaPatejuk, and Marcin Wolinski. 2014. Extendedphraseological information in a valence dictionaryfor NLP applications. In Proceedings of the Work-shop on Lexical and Grammatical Resources forLanguage Processing (LG-LP 2014), pages 83–91,Dublin, Ireland. Association for Computational Lin-guistics and Dublin City University.

Carlos Ramisch. 2015. Multiword expressions acquisi-tion: A generic and open framework, volume XIVof Theory and Applications of Natural LanguageProcessing. Springer. https://doi.org/10.1007/978-3-319-09207-2.

Carlos Ramisch, Silvio Ricardo Cordeiro, AgataSavary, Veronika Vincze, Verginica Barbu Mititelu,Archna Bhatia, Maja Buljan, Marie Candito, PolonaGantar, Voula Giouli, Tunga Güngör, AbdelatiHawwari, Uxoa Iñurrieta, Jolanta Kovalevskaite, Si-mon Krek, Timm Lichte, Chaya Liebeskind, Jo-hanna Monti, Carla Parra Escartín, Behrang Qasem-iZadeh, Renata Ramisch, Nathan Schneider, IvelinaStoyanova, Ashwini Vaidya, and Abigail Walsh.2018. Edition 1.1 of the PARSEME Shared Taskon automatic identification of verbal multiword ex-pressions. In Proceedings of the Joint Workshop onLinguistic Annotation, Multiword Expressions andConstructions (LAW-MWE-CxG-2018), pages 222–240. Association for Computational Linguistics.

Martin Riedl and Chris Biemann. 2016. Impact ofMWE resources on multiword recognition. In Pro-ceedings of the 12th Workshop on Multiword Ex-pressions, (MWE 2016), Berlin, Germany.

Omid Rohanian, Shiva Taslimipoor, SamanehKouchaki, Le An Ha, and Ruslan Mitkov. 2019.Bridging the gap: Attending to discontinuity inidentification of multiword expressions. CoRR,abs/1902.10667.

Jake Ryland Williams, Paul R. Lessard, Suma Desu,Eric M. Clark, James P. Bagrow, Christopher M.Danforth, and Peter Sheridan Dodds. 2015. Zipf’slaw holds for phrases, not words. Scientific Reports,5.

Ivan A. Sag, Timothy Baldwin, Francis Bond, AnnCopestake, and Dan Flickinger. 2002. MultiwordExpresions: A Pain in the Neck for NLP. In Pro-ceedings of CICLING’02. Springer.

Agata Savary. 2008. Computational Inflection ofMulti-Word Units. A contrastive study of lexical ap-proaches. Linguistic Issues in Language Technol-ogy, 1(2):1–53.

Agata Savary, Marie Candito, Verginica Barbu Mi-titelu, Eduard Bejcek, Fabienne Cap, Sla vo-mír Céplö, Silvio Ricardo Cordeiro, Gülsen Ery-igit, Voula Giouli, Maarten van Gompel, YaakovHaCohen-Kerner, Jolanta Kovalevskaite, SimonKrek, Chaya Lie bes kind, Johanna Monti,Carla Parra Escartín, Lonneke van der Plas, BehrangQasemiZadeh, Carlos Ramisch, Fe derico San-gati, Ivelina Stoyanova, and Veronika Vincze. 2018.PARSEME multilingual corpus of verbal multi-word expressions. In Stella Markantonatou, CarlosRamisch, Agata Savary, and Veronika Vincze, edi-tors, Multiword expressions at length and in depth.Extended papers from the MWE 2017 workshop,pages 87–147. Language Science Press, Berlin.

Agata Savary, Silvio Ricardo Cordeiro, Timm Lichte,Carlos Ramisch, Uxoa I nurrieta, and Voula Giouli.2019. Literal occurrences of multiword expressions:Rare birds that cause a stir. The Prague Bulletin ofMathematical Linguistics, 112:5–54.

Agata Savary, Carlos Ramisch, Silvio Cordeiro, Fed-erico Sangati, Veronika Vincze, Behrang Qasem-iZadeh, Marie Candito, Fabienne Cap, Voula Giouli,Ivelina Stoyanova, and Antoine Doucet. 2017. ThePARSEME Shared Task on automatic identificationof verbal multiword expressions. In Proceedings ofthe 13th Workshop on Multiword Expressions (MWE2017), pages 31–47, Valencia, Spain. Associationfor Computational Linguistics.

Nathan Schneider, Emily Danchik, Chris Dyer, andNoah A. Smith. 2014. Discriminative lexical se-mantic segmentation with gaps: running the MWEgamut. Transactions of the ACL, 2:193–206.

Violeta Seretan. 2011. Syntax-based collocation ex-traction. Text, Speech and Language Technology.Springer.

Max Silberztein. 2005. NooJ’s dictionaries. InProceedings of LTC’05, Poznan, pages 291–295.Wydawnictwo Poznanskie.

Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov,Jenya Belyaeva, and Erik van der Goot. 2011. JRC-NAMES: A freely available, highly multilingualnamed entity resource. In Proceedings of the In-ternational Conference Recent Advances in NaturalLanguage Processing 2011, pages 104–110, Hissar,Bulgaria. Association for Computational Linguis-tics.

https://doi.org/10.18653/v1/W17-1412

https://doi.org/10.18653/v1/W17-1412

https://doi.org/10.18653/v1/W17-1412

https://doi.org/10.18653/v1/W17-1412

http://ijl.oxfordjournals.org/content/early/2016/02/22/ijl.ecv048.abstract?keytype=ref&ijkey=jWNJn7Cxf7WJRhD



http://www.aclweb.org/anthology/siglex.html#2014_0



https://doi.org/10.1007/978-3-319-09207-2

https://doi.org/10.1007/978-3-319-09207-2

https://doi.org/10.1007/978-3-319-09207-2

https://doi.org/10.1007/978-3-319-09207-2




http://aclweb.org/anthology/W/W16/W16-1816.pdf

http://aclweb.org/anthology/W/W16/W16-1816.pdf

http://arxiv.org/abs/1902.10667


https://doi.org/10.1038/srep12209

https://doi.org/10.1038/srep12209

https://doi.org/10.5281/zenodo.1469527

https://doi.org/10.5281/zenodo.1469527

https://doi.org/10.2478/pralin-2019-0001

https://doi.org/10.2478/pralin-2019-0001

http://www.aclweb.org/anthology/W/W17/W17-1704



https://www.aclweb.org/anthology/R11-1015



91

Shiva Taslimipoor and Omid Rohanian. 2018.SHOMA at PARSEME Shared Task on automaticidentification of vmwes: Neural multiword ex-pression tagging with high generalisation. CoRR,abs/1809.03056.

Erik F. Tjong Kim Sang. 2002. Introduction to theCoNLL-2002 shared task: Language-independentnamed entity recognition. In Proceedings of the 6thConference on Natural Language Learning - Volume20, COLING-02, pages 1–4, Stroudsburg, PA, USA.Association for Computational Linguistics.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003, pages142–147.

Zdenka Urešová. 2012. Building the PDT-Vallex va-lency lexicon. In On-line Proceedings of the fifthCorpus Linguistics Conference, University of Liver-pool.

Ashwini Vaidya, Owen Rambow, and Martha Palmer.2014. Light verb constructions with ‘do’ and ‘be’in Hindi: A TAG analysis. In Proceedings of theWorkshop on Lexical and Grammatical Resourcesfor Language Processing, pages 127–136.

Aline Villavicencio, Ann Copestake, Benjamin Wal-dron, and Fabre Lambeau. 2004. Lexical Encodingof MWEs. In ACL Workshop on Multiword Expres-sions: Integrating Processing, July 2004, pages 80–87.

Veronika Vincze, István Nagy T., and Gábor Berend.2011. Multiword expressions and named entitiesin the wiki50 corpus. In Proceedings of the In-ternational Conference Recent Advances in NaturalLanguage Processing 2011, pages 289–295, Hissar,Bulgaria. Association for Computational Linguis-tics.

Jakub Waszczuk. 2018. TRAVERSAL at PARSEMEShared Task 2018: Identification of verbal mul-tiword expressions using a discriminative tree-structured model. In Proceedings of the Joint Work-shop on Linguistic Annotation, Multiword Expres-sions and Constructions (LAW-MWE-CxG-2018),pages 275–282. Association for Computational Lin-guistics.

Jakub Waszczuk, Katarzyna Glowinska, Agata Savary,Adam Przepiórkowski, and Michal Lenart. 2013.Annotation tools for syntax and named entities inthe National Corpus of Polish. IJDMMM, 5(2):103–122.

Jakub Waszczuk, Agata Savary, and Yannick Parmen-tier. 2016. Promoting multiword expressions in A*TAG parsing. In Proceedings of COLING 2016,the 26th International Conference on Computational

Linguistics: Technical Papers, pages 429–439, Os-aka, Japan. The COLING 2016 Organizing Commit-tee.

Marion Weller and Ulrich Heid. 2010. Extraction ofGerman Multiword Expressions from Parsed Cor-pora Using Context Features. In LREC.

Vikas Yadav and Steven Bethard. 2018. A survey onrecent advances in named entity recognition fromdeep learning models. In Proceedings of the 27th In-ternational Conference on Computational Linguis-tics, pages 2145–2158, Santa Fe, New Mexico,USA. Association for Computational Linguistics.

David Yarowsky. 1993. One sense per collocation.In Human Language Technology: Proceedings of aWorkshop Held at Plainsboro, New Jersey, March21-24, 1993.




https://doi.org/10.3115/1118853.1118877

https://doi.org/10.3115/1118853.1118877

https://doi.org/10.3115/1118853.1118877

https://www.aclweb.org/anthology/W03-0419

https://www.aclweb.org/anthology/W03-0419







https://doi.org/10.1504/IJDMMM.2013.053691

https://doi.org/10.1504/IJDMMM.2013.053691

https://www.aclweb.org/anthology/C16-1042





https://www.aclweb.org/anthology/H93-1052

Date post:	03-Apr-2022
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

Without lexicons, multiword expression identification will ...

Documents