arXiv:2006.09336v1 [cs.CL] 16 Jun 2020 · 2020. 6. 17. · tg, our model ranks transfer languages l...

Ranking Transfer Languages with Pragmatically-Motivated Featuresfor Multilingual Sentiment Analysis

Jimin Sun1* Hwijeen Ahn2* Chan Young Park3* Yulia Tsvetkov3 David R. Mortensen31Seoul National University, Republic of Korea

2Sogang University, Republic of Korea3Language Technologies Institute, Carnegie Mellon University, [email protected], [email protected],{chanyoun, ytsvetko, dmortens}@cs.cmu.edu

Abstract

Cross-lingual transfer learning studies howdatasets, annotations, and models can be trans-ferred from resource-rich languages to im-prove language technologies in resource-poorsettings. Recent works have shown thatwe can further benefit from the selection ofthe best transfer language. In this paper,we propose three pragmatically-motivated fea-tures that can help guide the optimal transferlanguage selection problem for cross-lingualtransfer. Specifically, the proposed featuresoperationalize cross-cultural similarities thatmanifest in various linguistic patterns: lan-guage context-level, sharing multi-word ex-pressions, and the use of emotion concepts.Our experimental results show that these fea-tures significantly improve the prediction ofoptimal transfer languages over baselines insentiment analysis, but are less useful fordependency parsing. Further analyses showthat the proposed features indeed capture theintended cross-cultural similarities and alignwell with existing work in sociolinguistics andlinguistic anthropology.

1 Introduction

Cross-lingual transfer of linguistic annotations,models, and raw corpora has been widely usedin multilingual natural language processing tasks,including machine translation (Zoph et al., 2016;Johnson et al., 2017; Neubig and Hu, 2018), mul-tilingual dependency parsing (Ammar et al., 2016;Ponti et al., 2018), and multilingual sentimentanalysis (Prettenhofer and Stein, 2010; Zhou et al.,2016). In this paper, we focus on settings in whichcross-lingual transfer learning is realized by incor-porating annotated datasets from different transferlanguages to improve task performance in the tar-get language. This setup has been shown to be

*The first three authors contributed equally.

especially helpful in low-resource scenarios (Dasand Hasegawa-Johnson, 2015; Agić et al., 2016).

However, not all transfer languages are equallyhelpful. Previous work has shown that select-ing the right set of training languages can sig-nificantly boost the performance of cross-lingualmodels (Paul et al., 2009; Lin et al., 2019; Wangand Neubig, 2019; Wang et al., 2020). Morespecifically, Lin et al. (2019) explore the problemof language selection, where a model predicts themost effective set of training languages for a giventarget language. They proposed a framework thatuses various syntactic and semantic features andshowed that automatic language selection couldsignificantly improve the effectiveness of cross-lingual learning.

Prior studies mainly introduced shallow lexi-cal, syntactic, and semantic features of languagesto understand the effectiveness of cross-lingualtransfer. However, these features may be insuffi-cient when the cross-lingual task is driven by prag-matic knowledge, as in the cross-lingual analysisof sentiment and emotion.1 Expression of subtlesentiment and emotion, such as subjective well-being (Smith et al., 2016), anger (Oster, 2019),or irony (Karoui et al., 2017), varies significantlyby culture. Mohammad et al. (2016) have shownthat, even with sound machine translation sys-tems, achieving cross-lingual transfer by translat-ing low-resource languages to high-resource lan-guages and applying models trained on the high-resource languages is impeded by culture-specific

1In linguistics, pragmatics has both a broad and a narrowsense. Narrowly, the term is used to refer to formal prag-matics. In the broad sense, which we employ in this paper,pragmatics refers to contextual factors in language use. Weare particularly concerned with cross-cultural pragmatics andfinding quantifiable linguistic measures that correspond to as-pects of cultural context. These measures are not the cul-tural characteristics that would be identified by anthropologi-cal linguists themselves but are rather intended to be measur-able correlates of these characteristics.

arX

iv:2

006.

0933

6v1

[cs

.CL

] 1

6 Ju

n 20

20

concepts. Some languages, for instance, Chi-nese and Korean, are used in East Asia in similarcultural contexts but possess significantly differ-ent syntactic structures (Jackson et al., 2019). Insuch cases, cross-cultural similarity can be one ofthe most important indicators for predicting cross-lingual transfer quality.

To operationalize cross-cultural similarity, wefocus on three distinct aspects in the intersec-tion of language and culture. First, every lan-guage and culture rely on different levels of con-text in communication. Western European lan-guages, such as German and English, are gen-erally considered low-context languages, whereasKorean and Japanese are considered high-contextlanguages. Second, similar cultures construct andconstrue figurative language similarly (Casas andCampoy, 1995; Vulanović, 2014). Finally, emo-tion semantics is similar between languages thatare culturally-related (Jackson et al., 2019). Forexample, in Persian (an Indo-Iranian languageof Iran), both ‘grief’ and ‘regret’ are expressedwith the same word ænduh whereas ‘grief‘ is co-lexified with ‘anxiety’ as dard in the Sirkhi dialectof Dargwa (a Dagestanian language of Russia).

The key contribution of our work ispragmatically-driven features that capturecross-cultural similarity: language context-levelratio, literal translation quality, and the emotionsemantic distance (§3). Extensive analysis ofeach feature verifies that they indeed capture theintended linguistic patterns, and thereby alignwith prior work from sociolinguistics and linguis-tic anthropology (§4). We further evaluate eachfeature’s effectiveness by incorporating them in atransfer-language ranking model, focusing on twoNLP tasks: sentiment analysis and dependencyparsing (§6). Our results corroborate our hypoth-esis that the pragmatically-motivated featuresboost the baseline for sentiment analysis but notfor dependency parsing, suggesting a connectionbetween sentiment and pragmatics (§7).2

2 Problem Formulation

We define our task as the language selection prob-lem: given the target language ltg, our model rankstransfer languages ltf by their usefulness whentransferred to ltg. Formally, we define transfer-ability of a language pair (ltf , ltg) as how useful

2Both code and data used in this paper are available athttps://github.com/hwijeen/langrank.

Figure 1: An example of training a ranking modelwith four languages. First, we obtain the optimalranking r based on zero-shot transfer performances zwith task-specific cross-lingual models (Step 1). Then,we use language similarity features f and the optimalranking to train the ranking model (Step 2).

language ltf is to a model for ltg. Effectivenessof cross-lingual transfer is often measured by jointtraining or zero-shot transfer performance (Wuand Dredze, 2019; Schuster et al., 2019). In thiswork, we quantify transferability as zero-shot per-formance, following Lin et al. (2019). For a giventarget language ltg and n candidates of additional(source) transfer languages Ltf = {l

(1)tf , . . . , l

(n)tf },

our goal is to train a model that ranks languages inLtf by their transferability.

Figure 1 illustrates the training procedure of thetransfer language ranking model that follows theset up in Lin et al. (2019). Before training, wefirst need to extract optimal transferability rank-ings, which can be used as the training data ofthe language ranking model (Step 1). For a giventarget language ltg, we evaluate the zero-shot per-formance of a model trained solely with transferlanguage ltf and tested on ltg, denoted as ztf,tg.After evaluating ztf,tg for each candidate transferlanguage in Ltf , we obtain the optimal ranking oflanguages rtg by sorting languages according totheir transferability to ltg. Note that the optimalrankings depend on the task and its characteristics.

Next, we train the language ranking model(Step 2). The ranking model predicts the trans-ferability ranking of candidate transfer languages.Each source, target pair (ltf , ltg) is represented asa vector of language features f tf,tg, which mayinclude phonological similarity, typological simi-larity, word-overlap to name a few. The rankingmodel takes f tf,tg of every ltf ∈ Ltf as input,and predicts the transferability ranking r̂tg. Us-

https://github.com/hwijeen/langrank

ing rtg from the previous step as training data, theranking model learns to find optimal transfer lan-guages using language features. Once the modelis trained, it can be used to predict transferabilityfor an unseen language pair, without the expensivecomputation process in step 1.

3 Pragmatically-motivated Features

Our main contribution is in proposing novel fea-tures to include in f that correlate with culturalsimilarities across languages. We hypothesizethat these cultural similarities are essential to ef-fectively rank transfer languages in pragmatics-driven tasks.

Language Context-level Ratio The languagecontext-level ratio (LCR) feature approximates theextent to which a pair of languages differ in leav-ing the identity of entities and predicates to con-text. For example, an English sentence Did youeat lunch? explicitly indicates the pronoun you,whereas the equivalent Korean sentence 점심 먹었니? (= Did eat lunch?) omits the pronoun. Thisis related to the concept of context-level, whichis considered one of the distinctive attributes ofa language’s pragmatics in linguistics and com-munication studies (Nada et al., 2001). If twolanguages have similar levels of context, theirspeakers are more likely to be from similar cul-tures (Nada et al., 2001). To capture this linguis-tic quality, we compute the pronoun- and verb-token ratio, ptr(lk) and vtr(lk) for each lan-guage lk, using part-of-speech tagging results. Wefirst run language-specific POS-taggers over eachlanguage’s large monolingual corpus.3 Next, wecompute ptr as the number of pronoun tokensover the number of all tokens. vtr is obtainedlikewise with verb tokens. Low ptr, vtr valuesmay indicate that a language leaves the identity ofentities and predicates, respectively, to context.

We then compare these values between the tar-get language ltg and transfer language ltf , whichleads to the following definition of LCR:

LCR-pron(ltf , ltg) =ptr(ltg)

ptr(ltf )

LCR-verb(ltf , ltg) =vtr(ltg)

vtr(ltf )

3List of POS taggers, tokenizer and monolingual corpusused in the paper is in the Appendix A.2

Literal Translation Quality Literal translationquality (LTQ) quantifies how well literal transla-tion, i.e., word-by-word translation using a bilin-gual dictionary4, works for a given language pair’smultiword expressions (MWEs). The motivationis that culturally similar languages share figu-rative language, including idiomatic MWEs andmetaphors. For example, like father like son in En-glish can be translated word-by-word into a simi-lar idiom tel père tel fils in French. However, inJapanese, a similar idiom蛙の子は蛙 (Kaeru noko wa kaeru) “A frog’s child is a frog.” cannot beliterally translated.

Since we do not have a well-curated list ofMWEs in every language, here we follow theMWE extraction approach from Tsvetkov andWintner (2010); for each language, we use PMI3

(Daille, 1994) to extract top-k MWE from a largenews-crawl corpus (Goldhahn et al., 2012). How-ever, the news-crawl corpus is often noisy, andthus extracted MWEs contain many data-specificartifact n-grams. To filter those out, we exploitanother smaller but reasonably large monolingualcorpus, the TED talk dataset (Qi et al., 2018). Wechoose top-k MWEs in terms of PMI3 that ap-peared in both monolingual corpora. In this paper,we used k = 500.

After retrieving MWEs, we use a bilingual dic-tionary of ltf and ltg and a parallel corpus betweenthe pair to measure LTQ(ltf , ltg).5 For each n-gram in ltg’s MWEs, we first look for target sen-tences in the parallel corpus that contain the n-gram. Then, per the found sentence, we look ateach word of the n-gram and its potential transla-tions in transfer language using the bilingual dic-tionary. For any word in the n-gram, if there is anyword translation in the source sentence, we con-sider this as hit, otherwise as miss. And we cal-culate hit ratio as hit(hit+miss) for each n-gram foundin the parallel corpus. Finally, we average the hitratios of all n-grams and set it as LTQ(ltf , ltg).6

Emotion Semantics Distance Emotion seman-tic distance (ESD) measures how similarly emo-tions are worded between languages. This isinspired by Jackson et al. (2019), where theyuse colexification patterns to capture the seman-tic similarity of languages. However, colexifica-

4https://github.com/kakaobrain/word2word

5We used TED talk dataset for the parallel corpus.6We further standardize the score (z-score) over the trans-

fer language.

https://github.com/kakaobrain/word2wordhttps://github.com/kakaobrain/word2word

tion patterns require human annotation, and exist-ing annotations may not be comprehensive. Here,we extend the method by using cross-lingual wordembeddings.

We define ESD as the average distance of emo-tion word vectors in transfer and target languages,after aligning word embeddings into the samespace. More specifically, we use 24 emotion con-cepts defined in Jackson et al. (2019) and use bilin-gual dictionaries to expand each concept into ev-ery other language. We then remove the emotionwords from the bilingual dictionaries, and use theremaining word pairs to align word embeddings ofsource into the space of target languages.7 Theo-retically, if words of the same emotion concept indifferent languages have exactly the same mean-ing, they should be aligned to the same point de-spite the lack of supervision. However, becauseeach language possesses different emotion seman-tics, emotions in each language are scattered intodifferent positions. Finally, we define ESD as theaverage cosine distance between languages:

ESD(ltf , ltg) =∑e∈E

cos(vtf,e,vtg,e)|E|

where E is the set of emotion concepts and vtf,e isthe aligned word vector of language ltf for emo-tion concept e.

4 Feature Analysis

In this section, we verify whether eachpragmatically-motivated feature correlateswith the intended pragmatic information.

4.1 LCR and Language Context-levelptr approximates how often discourse entitiesare indexed with pronouns rather than left conjec-turable from context. Similarly, vtr estimates therate at which predicates appear explicitly as verbs.In order to examine to what extent these featuresreflect context-levels, we plot languages on a two-dimensional plane where the x-axis indicates ptrand the y-axis indicates vtr in Figure 2. The plotreveals a clear pattern of context-levels in differ-ent languages. German, which is one of the low-context languages (Hall, 1989), possesses the sec-ond largest value of ptr. On the other extreme arelocated Korean and Japanese with low ptr, whichare representative of high-context languages. One

7We followed Lample et al. (2018) to generate these su-pervised cross-lingual word embeddings.

Figure 2: Plot of languages in ptr and vtr plane.German, a representative example of a low-contextlanguage, is located in the right corner. On the otherhand, high-context languages (e.g., Korean, Japanese)are located in the lower left.

thing to notice is the isolated location of Turk-ish with a high vtr. This is morphosyntacticallyplausible as a lot of information is expressed byaffixation to verbs in Turkish.

4.2 LTQ and MWEs

Since human-curated lists of figurative languageMWE (gold MWEs) are not always availablefor all languages, LTQ uses n-grams with highPMI scores (PMI MWEs) as proxies. Nonethe-less, for languages that have manual annotations,we can still use it to evaluate the quality of se-lected MWEs and the resultant LTQ. We collectedground-truth MWEs in multiple languages fromWiktionary8. We discarded languages with lessthan 2,000 phrases on the list, resulting in fourlanguages (English, French, German, Spanish) foranalysis.

First, we checked how many PMI MWEs are ac-tually in the gold MWEs. Out of the top 500 PMIbigrams and trigrams, 19.0% and 3.8% of themwere included in the gold MWE list, respectively.For example, the trigrams in the PMI MWEs, keepan eye and take into account, were considered tobe in the gold MWEs as keep an eye peeled andtake into account were in the list.

Secondly, to validate using PMI MWEs as prox-ies, we compare the LTQ of PMI MWEs with theLTQ using Gold MWEs. More specifically, us-

8For example, https://en.wiktionary.org/wiki/Category:English_idioms

https://en.wiktionary.org/wiki/Category:English_idiomshttps://en.wiktionary.org/wiki/Category:English_idioms

(a) Network of languages based on Emotion SemanticsDistance.

(b) Network of languages based on syntactic distance.

Figure 3: Network of languages. When a language is ranked in the top 2 closest languages (k = 2), an edge existsbetween the two languages. Color-coded cultural areas are defined according to Siegel (1977).

ing the same procedure and dataset explained in§3, we obtained the LTQ scores of each languagepair with target languages limited to the four Euro-pean languages mentioned above. For each targetlanguage, we then measured the Pearson correla-tion coefficient of LTQ scores between two lists.The average coefficient was 0.92, which indicatesa strong correlation between the LTQ of two lists,and thus justifies using PMI MWEs for all otherlanguages.

4.3 ESD and Cultural Grouping

We investigate what is carried by ESD by visu-alizing and looking at the nearest neighbors ofemotion vectors9. Jackson et al. (2019) revealedthat the emotion of hope clusters with differentkinds of emotion depending on the language fam-ily it belongs to using word collocations. For in-stance, in Tai-Kadai languages hope appears in thesame cluster as want and pity, while hope clus-ters with good and love in Nakh-Daghestanian lan-guage family. Our results derived from ESD alsosupport evidence to this finding, even without us-ing word collocations. For instance, the nearestneighbors to the French word for hope was worryand regret, while they were found as joy and goodfor Hindi hope.

We further investigate the suggested ESD fea-ture and show that it is an indicator of cultural sim-ilarity. We present a network of languages in Fig-ure 3a based on ESD. Languages are representedas different nodes and color-coded according tothe predefined cultural areas in Table 1. To drawedges, we set each language as the target language,and sort other languages according to ESD. When

9Emotion vectors visualization demo can be found fromhttps://bit.ly/emotion_vecs.

a language is in the list of top-k closest languages,an edge exists between the two languages.

In Figure 3, we compare two graphs based ondifferent linguistic features. Figure 3a uses ESDto draw edges between languages while Figure 3buses syntactic distance provided by the URIELpackage (Littell et al., 2017). We see that thelanguages sharing same cultural areas form cohe-sive clusters in Figure 3a compared to Figure 3b.The portion of edges within the cultural areas were76% of all edges in Figure 3a while it was 59%in Figure 3b. These results indicate that ESD ef-fectively extracts linguistic information that alignswell with the commonly shared perception of cul-tural groups.

5 Dataset

We apply the proposed features to train a rankingmodel for two distinctive tasks: multilingual sen-timent analysis (SA) and multilingual dependencyparsing (DEP). We hypothesize that high-order in-formation such as pragmatics would assist senti-ment analysis while it may be insignificant for de-pendency parsing, where lower-order informationsuch as syntax are relatively stressed. This sectionreports the dataset used for each task.

Sentiment Analysis As there is no single senti-ment analysis dataset covering a wide variety oflanguages, we collected various review datasetsfrom different sources. All samples are labeledas either positive or negative. In case of datasetsrated with scores ranging from 1 to 5, we mapped1–2 to negative and 4–5 to positive. We settledon a dataset consisting 16 languages categorizedinto five distinct cultural groups: Western Europe,Eastern Europe, East Asia, South Asia, Middle

https://bit.ly/emotion_vecs

Cultural Area Languages Domain Size

West Europe

German product 56333French product 20771English restaurant 1472Spanish restaurant 1396Dutch restaurant 1089

East EuropeRussian restaurant 2289Czech movie 54540Polish product 26284

East AsiaChinese electronics 2333Korean movie 18000Japanese product 21095

South Asia Hindi product 2707Tamil movie 417

Middle EastArabic hotel 4111Persian product 3904Turkish restaurant 907

Table 1: Data statistics for the sentiment analysis task.Datasets are reviews of different domains and ofdifferent sizes. We divided 16 languages into fivecultural groups based on cultural similarities.

East. Table 1 summarizes data in different groupswith their size and domain.

Since the data came from heterogeneoussources10, each languages’ dataset varied in sizeand domain: sizes ranging from 417 (Tamil) to625918 (German), and domains including hotel,restaurant, product, and movie reviews. To allevi-ate the size disparity between languages, we ran-domly took the subset of datasets with size largerthan 100K (Japanese, German, French, Korean)while preserving their label distribution.

Dependency Parsing In order to compare theeffectiveness of the proposed features on syntax-focused tasks, we selected datasets of the sameset of 16 languages from Universal Dependenciesv2.2 (Nivre et al., 2018).

6 Evaluation Setup

In this section, we describe the cross-lingual mod-els used in each of the two tasks and the transferlanguage ranking model with its evaluation metric.

SA Cross-lingual model We performed super-vised fine-tuning of multilingual BERT (mBERT)(Devlin et al., 2019) for the sentiment analysis

10A detailed list is provided in Appendix A.1

task, as it showed strong results in various textclassification tasks in cross-lingual settings (Sunet al., 2019; Xu et al., 2019; Li et al., 2019).mBERT is a multilingual extension of BERT pre-trained with 104 different languages, including the16 languages we used throughout our experiment.The model is shown to be highly effective in cross-lingual transfer even between languages using dif-ferent scripts (K et al., 2019; Pires et al., 2019).We used a concatenation of mean and max pool-ing of the representations from mBERT’s penulti-mate layer, as it outperformed the standard prac-tice of using the [CLS] token. The concate-nated representation was passed to a fully con-nected layer for prediction. The performance mea-sure is the macro F1 score on the held-out testset. To extract optimal transfer rankings, we con-ducted zero-shot transfer with mBERT: fine-tunedmBERT on transfer language data and tested it ontarget language data.

DEP Cross-lingual model For dependencyparsing, we adopted the setting from Ahmad et al.(2018), performing cross-lingual zero-shot trans-fer on the same set of languages as the sentimentanalysis task. We trained deep biaffine attentionalgraph-based models (Dozat and Manning, 2016),which achieved state-of-the-art performance in de-pendency parsing for many languages. To copewith multilingual vocabulary, we adopted an of-fline embedding method (Smith et al., 2017) thatmaps pretrained word embeddings of different lan-guages into the same space. The performance wasevaluated using labeled attachment scores (LAS).

Ranking model For the transfer language rank-ing model, we used Gradient boosted decisiontrees (Ke et al., 2017) trained with LambdaRank(Burges et al., 2007), which is one of the state-of-the-art models for ranking tasks. As in priorwork, we optimized normalized discounted cumu-lative gain (NDCG) to train the model (Järvelinand Kekäläinen, 2002).

Evaluation metric We evaluate the rankingmodels’ performance with two standard metricsdesigned for ranking problems: Mean AveragePrecision (MAP) and NDCG. MAP is computedby averaging the precision at each relevant item(AP), and averaging all AP scores for multipleranking tasks. We set relevant items as the top-3languages in terms of zero-shot performance fol-lowing Lin et al. (2019). NDCG enables more

fine-grained grading considering ranking positionsrather than assuming a binary concept of rele-vance/irrelevance. Here, we use NDCG@3 as theevaluation metric. We report the model’s aver-age test performance using leave-one-out cross-validation.

7 Experiments

We investigate the performance of the rankingmodel with the proposed features over two distinctdownstream tasks: Sentiment Analysis (SA) andDependency Parsing (DEP).

7.1 BaselineLin et al. (2019) We briefly describe the 13features used in Lin et al. (2019) to train theranking model. The dataset size in transfer lan-guage (tf size), target language (tg size),and the ratio between the two (ratio size)are included. Type-token-ratio (TTR) is a mea-sure of lexical diversity, defined by the ratio be-tween number of unique words and number oftokens. word overlap measures lexical sim-ilarity between a pair of languages. Other fea-tures are various types of distance between a pairof languages queried from the URIEL package(Littell et al., 2017): geographic (geo), genetic(gen), inventory (inv), syntactic (syn), phono-logical (phon) and featural (feat), adoptedfrom linguistic databases such as WALS (Dryerand Haspelmath, 2013), Glottolog (Hammarströmet al., 2020), and PHOIBLE (Moran and McCloy,2019).

Lin et al. (2019)-TTR Prior work suggests thattype-to-token ratio (TTR) encodes a significantamount of cultural information (Richards, 1987).Therefore, to examine the cultural informationcontained by each pragmatically-inspired featureand their contribution to performance more pre-cisely, we exclude TTR from the 13 features in-troduced in Lin et al. (2019) and set it as anotherbaseline.

7.2 Individual Feature ContributionWe added three pragmatically-inspired featuresone-by-one on top of Lin et al. (2019)-TTR base-line, as shown in Table 2. We also compare theseresults with the baseline and baseline plus all threepragmatically-inspired features (ALL).

The results show that adding individualpragmatically-inspired feature always improved

SA DEPMAP NDCG MAP NDCG

Lin et al. (2019) – TTR 55.6 84.8 46.2 82.3+ LCR 50.4 86.5 43.8 80.9+ LTQ 55.6 86.6 45.1 81.9+ ESD 53.8 84.9 44.4 80.8

Lin et al. (2019) 53.5 86.5 46.5 82.2Lin et al. (2019) + ALL 57.3 90.9 43.4 80.5

Table 2: Evaluation results on sentiment analysis anddependency parsing. When the proposed features areapplied to the baseline, the performance improves forsentiment analysis (SA), but not for dependencyparsing (DEP).

the baseline either in MAP or NDCG for senti-ment analysis. In contrast, for dependency pars-ing, pragmatically-inspired features degraded per-formance in most cases. In particular, when TTRwas excluded from the baseline, a slight improve-ment in performance was observed. The con-trasting results indicate that the pragmatic featurescapture additional information that help sentimentanalysis but disturb tasks distant from pragmatics,exemplified as DEP in our case.

7.3 Group-wise Contribution

As shown in the previous experiment, the samepragmatic information can be helpful to differentextents depending on the downstream task. Wefurther investigate what kind of information aidseach task by conducting group-wise comparisons.To this end, we group the features into five cate-gories: Data-specific, Typology, Geography, Or-thography, and Pragmatic. Data-specific featuresinclude tf size, tg size, and ratio size.Typological features include geo, syn, feat,phon, inv distances. Geographic features in-clude geo distance in isolation. Orthographic fea-ture is the word overlap between languages.Finally, the Pragmatic group consists of TTR andthe three proposed features, LCR, LTQ, and ESD.

Table 3 reports the performance of modelstrained with respective feature category. Interest-ingly, two tasks showed significantly different dis-tributions; SA had the best performance with thePragmatic group, and DEP had it with the Typol-ogy group. This again confirms that features indi-cating the cross-lingual transferability can be dif-ferent depending on the target task. More surpris-ingly in SA, using Pragmatic features performedcomparably to using all features reported in Ta-

SA DEPMAP NDCG MAP NDCG

Data-specific 50.7 85.4 15.6 55.0Typology 17.4 60.7 39.2 79.8Geography 5.7 55.0 9.7 65.1Orthography 19.3 56.6 21.4 60.5Pragmatic 58.7 88.0 23.6 71.8

Table 3: Evaluation results of feature groups.Pragmatic group played the most important role forsentiment analysis. In case of dependency parsing,typological features were the most important.

ble 2 as Lin et al. (2019) + ALL.

8 Analysis

Improvement in performance of the ranking modelon sentiment analysis showed that the proposedfeatures provide meaningful information. In thissection, we provide a qualitative analysis with anexample ranking prediction and show how the fea-ture is related to the geographical distance.

8.1 Controlled experimentThe performance of cross-lingual transfer dependsnot only on the cultural similarity between trans-fer and target languages but also on other factors,including dataset size and label distribution. Tobetter understand the importance of cultural sim-ilarity in sentiment analysis, we conduct a con-trolled experiment; we fixed the dataset size andlabel distribution for all languages, and extractedthe optimal transferability rankings. Note that alldata were down-sampled to match the size andlabel distribution of the second smallest Turkishdataset.11 The rankings of the controlled experi-ment were then used to train two ranking modelswith different features: 13 features from Lin et al.(2019) and the proposed 3 pragmatic features.

Table 4 shows the relative ranking of predictedand optimal rankings when the target languageis Turkish. When Turkish is the target, Arabic,Japanese, and Korean are a particularly interestingsubset of transfer languages. Korean and Japaneseare similar both typologically and culturally. Turk-ish and Korean are typologically very similar, yetin cultural terms, Turkish is more similar to Ara-bic. Therefore, we specifically focus on how thepredicted ranking of these three languages differ

11The performance of the smallest language (Tamil; 417samples) was significantly low.

Lin et al. (2019) Pragmatic Optimal

1 jpn ara ara2 ara jpn kor3 kor kor jpn

Table 4: Relative ranking of the transfer languagesArabic, Japanese and Korean when target language isTurkish.

according to the features used to represent the lan-guage pair.

In the controlled setting, the relative optimalranking of the three languages is Arabic, followedby Korean and Japanese. The optimal ranking in-dicates the important role of cultural resemblance,considering the rich historical relationship sharedbetween Arabic- and Turkish-speaking communi-ties. The model with pragmatic features was ableto choose Arabic as the best transfer language,suggesting that imposed cultural similarity infor-mation from the features helped the ranking modellearn the cultural tie between the two languages.On the other hand, the baseline model of Lin et al.(2019) ranked Japanese the highest (over Arabic),possibly because these features focus on typologi-cal similarity over cultural similarity.

8.2 Correlation with Geographical Distance

Regarding the cluster of languages in Figure 3a,some might suspect that geographic distance(geo) might be able to substitute the suggestedpragmatic features. For instance, Korean andJapanese were the most relevant languages forChinese in Figure 3a, which can also be explainedby geographical proximity. Do our features addadditional pragmatic information, or can they besubsumed by geographical distance?

To verify this speculation, we evaluate Pear-son’s correlation coefficient between the prag-matic features and geographical distance. Themost correlated feature, ESD, had a positive cor-relation (r = 0.4) with geographic distance. Theleast correlated feature was LCR-verb (r =0.027). LTQ and LCR-pron correlated by -0.31and 0.17, respectively. These results suggest thatthe pragmatic features contain extra informationthat cannot be entirely subsumed by geographicdistance.

9 Related Work

Auxiliary Language Selection in Multilingualtasks There has been active work on leveragingmultiple languages to improve cross-lingual sys-tems (Neubig and Hu, 2018; Ammar et al., 2016).Adapting auxiliary language datasets to the tar-get language task can be practiced through ei-ther language-selection or data-selection. Previ-ous work on language-selection mostly relied onleveraging syntactic or semantic resemblance be-tween languages (e.g., ngram overlap) to choosethe best transfer languages (Zoph et al., 2016;Wang and Neubig, 2019). Meanwhile, work ondata-selection finds applicable samples in transferlanguages that enhance performance in the targetlanguage task (Wang and Neubig, 2019; Do andGaspers, 2019), motivated by previous studies indomain adaptation (Ruder and Plank, 2017; Plankand van Noord, 2011). Our approach is an exten-sion to the former, language-level selection, butfocused on pragmatic similarity, which has beenleft unexplored by previous studies.

Cross-lingual Sentiment Classification Cross-lingual sentiment classification (CLSC) has beenstudied primarily in the low-resource settings.Traditional methods in CLSC often rely on ma-chine translation systems (Wan, 2009) or bilin-gual resources (Barnes et al., 2018) to transfer re-sources from high to low resource languages. Ap-proaches such as Chen et al. (2018) attempt toeliminate this need by introducing an adversar-ial network that promotes language-invariant fea-tures. Recent works on multilingual pretrainedlanguage models have facilitated seamless trans-fer by providing a universal vocabulary that sup-ports more than a hundred languages (Devlin et al.,2019; Lample and Conneau, 2019). Many sub-sequent studies have examined the cross-lingualability of these models (Wu and Dredze, 2019;Pires et al., 2019; K et al., 2019). Still, our work isthe first to focus on aiding knowledge transfer inCLSC by operationalizing pragmatic knowledge.

10 Conclusion

In this work, we propose three pragmatically-inspired features that can help determine the opti-mal transfer languages: language context-level ra-tio, literal translation quality, and emotion seman-tic distance. Our features aim to capture linguis-tic patterns that indicate cultural similarities be-

tween languages, and analyses confirm that theycorrelate well with the existing literature. Ex-perimental results show that appending these fea-tures to a transfer language ranking model cansignificantly improve performance in sentimentanalysis, while not as much in dependency pars-ing. These results suggest the importance of prag-matic information for sentiment-involved tasks,and we expect to see even greater performancegain with more pragmatically-driven tasks such ashate speech detection and sarcasm identification.We leave this exploration for future work.

ReferencesŽeljko Agić, Anders Johannsen, Barbara Plank,

Héctor Martı́nez Alonso, Natalie Schluter, and An-ders Søgaard. 2016. Multilingual projection forparsing truly low-resource languages. Transactionsof the Association for Computational Linguistics,4:301–312.

Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma,Eduard H. Hovy, Kai-Wei Chang, and NanyunPeng. 2018. Near or far, wide range zero-shot cross-lingual dependency parsing. CoRR,abs/1811.00570.

Waleed Ammar, George Mulcaire, Miguel Ballesteros,Chris Dyer, and Noah A. Smith. 2016. Many lan-guages, one parser. Transactions of the Associationfor Computational Linguistics, 4:431–444.

Jeremy Barnes, Roman Klinger, and Sabine Schulte imWalde. 2018. Bilingual sentiment embeddings:Joint projection of sentiment across languages. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 2483–2493, Melbourne, Aus-tralia. Association for Computational Linguistics.

Christopher J. Burges, Robert Ragno, and Quoc V. Le.2007. Learning to rank with nonsmooth cost func-tions. In B. Schölkopf, J. C. Platt, and T. Hoffman,editors, Advances in Neural Information ProcessingSystems 19, pages 193–200. MIT Press.

Rafael Monroy Casas and JM Hernández Campoy.1995. A sociolinguistic approach to the study of id-ioms: Some anthropolinguistic sketches. Cuader-nos de Filologı́a inglesa, 4.

Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie,and Kilian Weinberger. 2018. Adversarial deep av-eraging networks for cross-lingual sentiment classi-fication. Transactions of the Association for Com-putational Linguistics, 6:557–570.

Béatrice Daille. 1994. Approche mixte pourl’extraction automatique de terminologie: statis-tiques lexicales et filtres linguistiques. Ph.D. thesis,Ph. D. thesis, Université Paris 7.

http://arxiv.org/abs/1811.00570http://arxiv.org/abs/1811.00570https://doi.org/10.1162/tacl_a_00109https://doi.org/10.1162/tacl_a_00109https://doi.org/10.18653/v1/P18-1231https://doi.org/10.18653/v1/P18-1231http://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdfhttp://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdfhttps://doi.org/10.1162/tacl_a_00039https://doi.org/10.1162/tacl_a_00039https://doi.org/10.1162/tacl_a_00039

Amit Das and Mark Hasegawa-Johnson. 2015. Cross-lingual transfer learning during supervised trainingin low resource scenarios. In Sixteenth Annual Con-ference of the International Speech CommunicationAssociation.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Quynh Do and Judith Gaspers. 2019. Cross-lingualtransfer learning with data selection for large-scalespoken language understanding. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 1455–1460, Hong Kong,China. Association for Computational Linguistics.

Timothy Dozat and Christopher D. Manning. 2016.Deep biaffine attention for neural dependency pars-ing. CoRR, abs/1611.01734.

Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS Online. Max Planck Institute for Evo-lutionary Anthropology, Leipzig.

Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.2012. Building large monolingual dictionaries atthe leipzig corpora collection: From 100 to 200 lan-guages. In Proceedings of the Eighth InternationalConference on Language Resources and Evaluation(LREC’12), pages 759–765, Istanbul, Turkey. Euro-pean Language Resources Association (ELRA).

Edward Twitchell Hall. 1989. Beyond culture. Anchor.

Harald Hammarström, Robert Forkel, Martin Haspel-math, and Sebastian Bank. 2020. Glottolog 4.2.1.Max Planck Institute for the Science of Human His-tory.

Joshua Conrad Jackson, Joseph Watts, Teague R.Henry, Johann-Mattis List, Robert Forkel, Peter J.Mucha, Simon J. Greenhill, Russell D. Gray, andKristen A. Lindquist. 2019. Emotion semanticsshow both cultural variation and universal structure.Science, 366(6472):1517–1522.

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cu-mulated gain-based evaluation of ir techniques.ACM Transactions on Information Systems (TOIS),20(4):422–446.

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viégas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’smultilingual neural machine translation system: En-abling zero-shot translation. Transactions of the As-sociation for Computational Linguistics, 5:339–351.

Karthikeyan K, Zihan Wang, Stephen Mayhew, andDan Roth. 2019. Cross-lingual ability of multilin-gual bert: An empirical study.

Jihen Karoui, Farah Benamara, Véronique Moriceau,Viviana Patti, Cristina Bosco, and NathalieAussenac-Gilles. 2017. Exploring the impact ofpragmatic phenomena on irony detection in tweets:A multilingual corpus study. In Proceedings of the15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume1, Long Papers, pages 262–272, Valencia, Spain.Association for Computational Linguistics.

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang,Wei Chen, Weidong Ma, Qiwei Ye, and Tie-YanLiu. 2017. Lightgbm: A highly efficient gradientboosting decision tree. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, editors, Advances in Neural Infor-mation Processing Systems 30, pages 3146–3154.Curran Associates, Inc.

Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. Advances inNeural Information Processing Systems (NeurIPS).

Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Hervé Jégou. 2018.Word translation without parallel data. In Interna-tional Conference on Learning Representations.

Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam.2019. Exploiting BERT for end-to-end aspect-basedsentiment analysis. In Proceedings of the 5th Work-shop on Noisy User-generated Text (W-NUT 2019),pages 34–41, Hong Kong, China. Association forComputational Linguistics.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,Junxian He, Zhisong Zhang, Xuezhe Ma, AntoniosAnastasopoulos, Patrick Littell, and Graham Neu-big. 2019. Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 3125–3135, Florence, Italy. Associa-tion for Computational Linguistics.

Patrick Littell, David R. Mortensen, Ke Lin, Kather-ine Kairis, Carlisle Turner, and Lori Levin. 2017.URIEL and lang2vec: Representing languages as ty-pological, geographical, and phylogenetic vectors.In Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Volume 2, Short Papers, pages 8–14,Valencia, Spain. Association for Computational Lin-guistics.

Saif M Mohammad, Mohammad Salameh, and Svet-lana Kiritchenko. 2016. How translation alters sen-timent. Journal of Artificial Intelligence Research,55:95–130.

https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/D19-1153https://doi.org/10.18653/v1/D19-1153https://doi.org/10.18653/v1/D19-1153http://arxiv.org/abs/1611.01734http://arxiv.org/abs/1611.01734https://wals.info/http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdfhttps://glottolog.org/ accessed 2020-06-09https://doi.org/10.1126/science.aaw8160https://doi.org/10.1126/science.aaw8160https://doi.org/10.1162/tacl_a_00065https://doi.org/10.1162/tacl_a_00065https://doi.org/10.1162/tacl_a_00065http://arxiv.org/abs/1912.07840http://arxiv.org/abs/1912.07840https://www.aclweb.org/anthology/E17-1025https://www.aclweb.org/anthology/E17-1025https://www.aclweb.org/anthology/E17-1025http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdfhttp://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdfhttps://openreview.net/forum?id=H196sainbhttps://doi.org/10.18653/v1/D19-5505https://doi.org/10.18653/v1/D19-5505https://doi.org/10.18653/v1/P19-1301https://doi.org/10.18653/v1/P19-1301https://www.aclweb.org/anthology/E17-2002https://www.aclweb.org/anthology/E17-2002

Steven Moran and Daniel McCloy, editors. 2019.PHOIBLE 2.0. Max Planck Institute for the Scienceof Human History, Jena.

Korac-Kakabadse Nada, Kouzmin Alexander, Korac-Kakabadse Andrew, and Savery Lawson. 2001.Low-and high-context communication patterns: to-wards mapping cross-cultural encounters. 8(2):3–24.

Graham Neubig and Junjie Hu. 2018. Rapid adapta-tion of neural machine translation to new languages.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages875–880, Brussels, Belgium. Association for Com-putational Linguistics.

Joakim Nivre, Mitchell Abrams, Agić, and et al. 2018.Universal dependencies 2.2. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Ap-plied Linguistics (ÚFAL), Faculty of Mathematicsand Physics, Charles University.

Ulrike Oster. 2019. Cross-cultural semantic and prag-matlc profiling of emotion words. regulation andexpression of anger in spanish and german. Cur-rent Approaches to Metaphor Analysis in Discourse,39:35.

Michael Paul, Hirofumi Yamamoto, Eiichiro Sumita,and Satoshi Nakamura. 2009. On the impor-tance of pivot language selection for statistical ma-chine translation. In Proceedings of Human Lan-guage Technologies: The 2009 Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics, Companion Volume:Short Papers, pages 221–224, Boulder, Colorado.Association for Computational Linguistics.

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual bert? In ACL.

Barbara Plank and Gertjan van Noord. 2011. Effectivemeasures of domain similarity for parsing. In Pro-ceedings of the 49th Annual Meeting of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 1566–1576, Portland,Oregon, USA. Association for Computational Lin-guistics.

Edoardo Maria Ponti, Roi Reichart, Anna Korhonen,and Ivan Vulić. 2018. Isomorphic transfer of syn-tactic structures in cross-lingual NLP. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1531–1542, Melbourne, Australia. As-sociation for Computational Linguistics.

Peter Prettenhofer and Benno Stein. 2010. Cross-language text classification using structural corre-spondence learning. In Proceedings of the 48thAnnual Meeting of the Association for Computa-tional Linguistics, pages 1118–1127, Uppsala, Swe-den. Association for Computational Linguistics.

Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad-manabhan, and Graham Neubig. 2018. When andwhy are pre-trained word embeddings useful forneural machine translation? In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Short Pa-pers), pages 529–535, New Orleans, Louisiana. As-sociation for Computational Linguistics.

Brian Richards. 1987. Type/token ratios: What do theyreally tell us? Journal of child language, 14(2):201–209.

Sebastian Ruder and Barbara Plank. 2017. Learning toselect data for transfer learning with Bayesian opti-mization. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, pages 372–382, Copenhagen, Denmark. Asso-ciation for Computational Linguistics.

Tal Schuster, Ori Ram, Regina Barzilay, and AmirGloberson. 2019. Cross-lingual alignment of con-textual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers), pages 1599–1613, Minneapolis, Min-nesota. Association for Computational Linguistics.

Bernard J. Siegel. 1977. Encyclopedia of anthropol-ogy. david e. hunter and phillip whitten, eds. newyork. American Anthropologist, 79(2):452–454.

Laura Smith, Salvatore Giorgi, Rishi Solanki, Jo-hannes Eichstaedt, H Andrew Schwartz, Muham-mad Abdul-Mageed, Anneke Buffone, and Lyle Un-gar. 2016. Does ‘well-being’translate on twitter?In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages2042–2047.

Samuel L. Smith, David H. P. Turban, Steven Hamblin,and Nils Y. Hammerla. 2017. Offline bilingual wordvectors, orthogonal transformations and the invertedsoftmax. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France,April 24-26, 2017, Conference Track Proceedings.OpenReview.net.

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.2019. How to fine-tune bert for text classification?In Chinese Computational Linguistics, pages 194–206, Cham. Springer International Publishing.

Yulia Tsvetkov and Shuly Wintner. 2010. Extraction ofmulti-word expressions from small parallel corpora.In Coling 2010: Posters, pages 1256–1264, Beijing,China. Coling 2010 Organizing Committee.

Jelena Vulanović. 2014. Cultural markedness andstrategies for translating idiomatic expressions inthe epic poem “the mountain wreath” into en-glish. Mediterranean Journal of Social Sciences,5(13):210.

https://phoible.org/https://doi.org/10.1108/13527600110797218https://doi.org/10.1108/13527600110797218https://doi.org/10.18653/v1/D18-1103https://doi.org/10.18653/v1/D18-1103http://hdl.handle.net/11234/1-2837https://www.aclweb.org/anthology/N09-2056https://www.aclweb.org/anthology/N09-2056https://www.aclweb.org/anthology/N09-2056https://www.aclweb.org/anthology/P11-1157https://www.aclweb.org/anthology/P11-1157https://doi.org/10.18653/v1/P18-1142https://doi.org/10.18653/v1/P18-1142https://www.aclweb.org/anthology/P10-1114https://www.aclweb.org/anthology/P10-1114https://www.aclweb.org/anthology/P10-1114https://doi.org/10.18653/v1/N18-2084https://doi.org/10.18653/v1/N18-2084https://doi.org/10.18653/v1/N18-2084https://doi.org/10.18653/v1/D17-1038https://doi.org/10.18653/v1/D17-1038https://doi.org/10.18653/v1/D17-1038https://doi.org/10.18653/v1/N19-1162https://doi.org/10.18653/v1/N19-1162https://doi.org/10.18653/v1/N19-1162https://doi.org/10.1525/aa.1977.79.2.02a00250https://doi.org/10.1525/aa.1977.79.2.02a00250https://doi.org/10.1525/aa.1977.79.2.02a00250https://openreview.net/forum?id=r1Aab85gghttps://openreview.net/forum?id=r1Aab85gghttps://openreview.net/forum?id=r1Aab85gghttps://www.aclweb.org/anthology/C10-2144https://www.aclweb.org/anthology/C10-2144

Xiaojun Wan. 2009. Co-training for cross-lingual sen-timent classification. In Proceedings of the JointConference of the 47th Annual Meeting of the ACLand the 4th International Joint Conference on Natu-ral Language Processing of the AFNLP, pages 235–243, Suntec, Singapore. Association for Computa-tional Linguistics.

Xinyi Wang and Graham Neubig. 2019. Target condi-tioned sampling: Optimizing data selection for mul-tilingual neural machine translation. In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics, pages 5823–5828, Flo-rence, Italy. Association for Computational Linguis-tics.

Xinyi Wang, Yulia Tsvetkov, and Graham Neubig.2020. Balancing training for multilingual neuralmachine translation. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics.

Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofBERT. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages833–844, Hong Kong, China. Association for Com-putational Linguistics.

Hu Xu, Bing Liu, Lei Shu, and Philip Yu. 2019. BERTpost-training for review reading comprehension andaspect-based sentiment analysis. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers), pages 2324–2335, Minneapolis,Minnesota. Association for Computational Linguis-tics.

Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016.Cross-lingual sentiment classification with bilingualdocument representation learning. In Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers), pages 1403–1412, Berlin, Germany. Associ-ation for Computational Linguistics.

Barret Zoph, Deniz Yuret, Jonathan May, and KevinKnight. 2016. Transfer learning for low-resourceneural machine translation. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 1568–1575, Austin,Texas. Association for Computational Linguistics.

https://www.aclweb.org/anthology/P09-1027https://www.aclweb.org/anthology/P09-1027https://doi.org/10.18653/v1/P19-1583https://doi.org/10.18653/v1/P19-1583https://doi.org/10.18653/v1/P19-1583https://doi.org/10.18653/v1/D19-1077https://doi.org/10.18653/v1/D19-1077https://doi.org/10.18653/v1/D19-1077https://doi.org/10.18653/v1/N19-1242https://doi.org/10.18653/v1/N19-1242https://doi.org/10.18653/v1/N19-1242https://doi.org/10.18653/v1/P16-1133https://doi.org/10.18653/v1/P16-1133https://doi.org/10.18653/v1/D16-1163https://doi.org/10.18653/v1/D16-1163

A Supplemental Material

A.1 Dataset for Sentiment Analysis

Dataset Languages Domain Size POS/NEG

SemEval-2016 Aspect BasedSentiment Analysis12

Chinese electronics 2333 1.53Arabic hotel 4111 1.54English restaurant 1472 2.14Dutch restaurant 1089 1.43

Spanish restaurant 1396 2.82Russian restaurant 2289 3.81Turkish restaurant 907 1.32

SentiPers13 Persian product 3904 1.8

Amazon Customer ReviewsFrench product 20771 8.0German product 56333 6.56Japanese product 21095 8.05

CSFD CZ14 Czech movie 54540 1.04

Naver Sentiment Movie Corpus15 Korean movie 18000 1.0

Tamil Movie Review Dataset16 Tamil movie 417 0.48

PolEval 201717 Polish product 26284 1.38

Aspect based Sentiment Analysis18 Hindi product 2707 3.22

Table 5: Datasets for sentiment analysis.

12http://alt.qcri.org/semeval2016/task5/

13https://arxiv.org/ftp/arxiv/papers/1801/1801.07737.pdf

14http://nlp.kiv.zcu.cz/research/sentiment

15https://github.com/e9t/nsmc16https://www.kaggle.com/

sudalairajkumar/tamil-nlp17http://clip.ipipan.waw.pl/PolEval?

action=AttachFile&do=view&target=poleval-2017-task-1ab-gold-2.0-tei.tar.gz

18http://www.lrec-conf.org/proceedings/lrec2016/pdf/698_Paper.pdf

http://alt.qcri.org/semeval2016/task5/http://alt.qcri.org/semeval2016/task5/https://arxiv.org/ftp/arxiv/papers/1801/1801.07737.pdfhttps://arxiv.org/ftp/arxiv/papers/1801/1801.07737.pdfhttp://nlp.kiv.zcu.cz/research/sentimenthttp://nlp.kiv.zcu.cz/research/sentimenthttps://github.com/e9t/nsmchttps://www.kaggle.com/sudalairajkumar/tamil-nlphttps://www.kaggle.com/sudalairajkumar/tamil-nlphttp://clip.ipipan.waw.pl/PolEval?action=AttachFile&do=view&target=poleval-2017-task-1ab-gold-2.0-tei.tar.gzhttp://clip.ipipan.waw.pl/PolEval?action=AttachFile&do=view&target=poleval-2017-task-1ab-gold-2.0-tei.tar.gzhttp://clip.ipipan.waw.pl/PolEval?action=AttachFile&do=view&target=poleval-2017-task-1ab-gold-2.0-tei.tar.gzhttp://clip.ipipan.waw.pl/PolEval?action=AttachFile&do=view&target=poleval-2017-task-1ab-gold-2.0-tei.tar.gzhttp://www.lrec-conf.org/proceedings/lrec2016/pdf/698_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2016/pdf/698_Paper.pdf

A.2 List of POS Taggers and Monolingualcorpus

Language POS Tagger Tokenizer

Arabic RDR POS Tagger19 PyArabic20

Chinese Jieba21 JiebaDanish RDR POS Tagger NLTKDutch RDR POS Tagger NLTKGreek RDR POS Tagger NLTKEnglish RDR POS Tagger NLTKFrench RDR POS Tagger NLTKGerman RDR POS Tagger NLTKHindi RDR POS Tagger NLTKJapanese Kytea22 KyteaKorean Mecab23 MecabPersian RDR POS Tagger NLTKRussian RDR POS Tagger NLTKSpanish RDR POS Tagger NLTKTamil RDR POS Tagger NLTKTurkish RDR POS Tagger NLTK

Table 6: List of POS taggers.

19https://github.com/datquocnguyen/RDRPOSTagger

20https://github.com/linuxscout/pyarabic

21https://github.com/fxsjy/jieba22https://github.com/neubig/kytea23https://github.com/konlpy/konlpy/
https://github.com/datquocnguyen/RDRPOSTaggerhttps://github.com/datquocnguyen/RDRPOSTaggerhttps://github.com/linuxscout/pyarabichttps://github.com/linuxscout/pyarabichttps://github.com/fxsjy/jiebahttps://github.com/neubig/kyteahttps://github.com/konlpy/konlpy/

Date post:	23-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

arXiv:2006.09336v1 [cs.CL] 16 Jun 2020 · 2020. 6. 17. · tg, our model ranks transfer languages l...

Documents