Single and Combined Features for the Detection of Anglicisms in … · 2014. 4. 7. · Another...

Single and Combined Features forthe Detection of Anglicisms in

German and Afrikaans

Bachelor’s Thesis at Cognitive Systems LabProf. Dr.-Ing. Tanja SchultzDepartment of Informatics

Karlsruhe Institute of Technology

by

Sebastian Leidig

Supervisors:

Dipl.-Inform. Tim SchlippeProf. Dr.-Ing. Tanja Schultz

Date of registration: 01. November 2013Date of completion: 31. January 2014

KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Ich erklare hiermit, dass ich die vorliegende Arbeit selbstandig verfasst und keineanderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

Karlsruhe, den 31. Januar 2014

Abstract

We develop, analyze and combine features for the automatic detection of Angli-cisms included in German and Afrikaans text which can improve automatic speechrecognition, speech synthesis and other fields such as natural language processing.

To evaluate our methods we collected and annotated two German word lists fromdifferent domains (IT, general news). We also applied our detection methods to anAfrikaans word list from the NCHLT corpus.

Our features are based on grapheme perplexity, grapheme-to-phoneme (G2P) confi-dence, Google hits count as well as spell-checker dictionary and Wiktionary lookup.With our G2P confidence and Wiktionary features we introduce new approaches todetect Anglicisms. Comparing features based on English models and models of thematrix language allows us to refrain from determining thresholds in a supervisedway. Furthermore we do not rely on training data that needs to be expensivelyannotated – instead we use available resources like word lists and pronunciation dic-tionaries. Our best single feature is based on the G2P confidence with an f-score ofup to 70.39%.

Combining our features using a voting, decision tree or support vector machine(SVM) gives us further improvements, especially where the single features performedpoorly. We achieve up to 44% relative improvement in f-score on our Afrikaans data.Our best result with a combination is an f-score of 75.44%.

Zusammenfassung

Wir entwickeln, analysieren und kombinieren Methoden zur automatischen Erken-nung von Anglizismen in deutschem und afrikaansem Text. Die automatische Erken-nung kann zur Verbesserung von Spracherkennern, Sprachsynthese und in anderenBereichen fur die Verarbeitung naturlicher Sprache eingesetzt werden.

Um unsere Ansatze zu evaluieren, haben wir deutsche Wortlisten aus unterschiedlichenBereichen (IT, allgemeine Nachrichten) gesammelt und annotiert. Zusatzlich wur-den unsere Erkennungsmethoden auf eine afrikaanse Wortliste des NCHLT-Korpusangewendet.

Unsere Methoden basieren auf Graphem-Perplexitat, Graphem-zu-Phonem (G2P)Konfidenz, Google Suchergebnissen, sowie auf Rechtschreib-Worterbuchern und Wik-tionary. Mit der G2P-Konfidenz und dem Wiktionary Ansatz entwickeln wir neueMethoden zur Erkennung von Anglizismen. Durch den Vergleich zwischen englischenund deutschen bzw. afrikaansen Modellen vermeiden wir Schwellwerte uberwacht zutrainieren. Zudem setzen wir nicht auf aufwandig annotierte Trainingsdaten, son-dern nutzen Resourcen wie Wortlisten oder Ausspracheworterbucher. Unser erfolgre-ichster Ansatz mit einer F-Score von bis zu 70.39% basiert auf der G2P-Konfidenz.

Die Kombination unserer Methoden durch ein Wahlverfahren, einen Entscheidungs-baum oder eine ”Support Vector Machine” (SVM) bringt weitere Verbesserungen– insbesondere fur die Daten, auf denen die einzelnen Methoden nur eine schlechteErkennungsleistungen lieferten. Wir erreichen eine relative Verbesserung der F-Scorevon bis zu 44% auf unseren afrikaansen Daten. Unsere beste Erkennungsleistungdurch eine Kombination liefert eine F-Score von 75.44%.

Contents

1 Introduction 11.1 Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Ambiguous words . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Closely related languages . . . . . . . . . . . . . . . . . . . . . 31.2.3 Variance between domains . . . . . . . . . . . . . . . . . . . . 3

1.3 Contribution and outline . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Basics 52.1 Types of foreign words and Anglicisms . . . . . . . . . . . . . . . . . 5

2.1.1 Loan word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Pseudo-Anglicism . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Hybrid word . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Other forms of language contact . . . . . . . . . . . . . . . . . 6

2.1.4.1 Code-Switching . . . . . . . . . . . . . . . . . . . . . 62.1.4.2 Loan translation . . . . . . . . . . . . . . . . . . . . 6

2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 F-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Related work 93.1 Foreign word detection . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Named entity recognition . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Test sets 114.1 German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 Annotation guidelines . . . . . . . . . . . . . . . . . . . . . . 114.1.2 Word list from the IT domain . . . . . . . . . . . . . . . . . . 124.1.3 Word list from the news domain . . . . . . . . . . . . . . . . . 12

4.2 Afrikaans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.1 The NCHLT-af word list . . . . . . . . . . . . . . . . . . . . . 13

4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Features for Anglicism detection 155.1 Grapheme Perplexity Feature . . . . . . . . . . . . . . . . . . . . . . 15

5.1.1 Grapheme-level models . . . . . . . . . . . . . . . . . . . . . . 155.1.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.1.3 Training grapheme-level language models . . . . . . . . . . . . 17

viii Contents

5.1.4 Detection based on absolute perplexity threshold . . . . . . . 175.1.5 Detection based on perplexity difference . . . . . . . . . . . . 185.1.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2 G2P Confidence Feature . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.1 G2P conversion . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.2 Phonetisaurus . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2.3 Pronunciation dictionaries . . . . . . . . . . . . . . . . . . . . 225.2.4 Detection based on G2P confidence . . . . . . . . . . . . . . . 225.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.6 Effect of dictionary size . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Hunspell Lookup Features . . . . . . . . . . . . . . . . . . . . . . . . 255.3.1 Hunspell spell-checker . . . . . . . . . . . . . . . . . . . . . . 255.3.2 Detection based on Hunspell lookups . . . . . . . . . . . . . . 26

5.3.2.1 English Hunspell lookup . . . . . . . . . . . . . . . . 265.3.2.2 Matrix language Hunspell lookup . . . . . . . . . . . 265.3.2.3 Combination of lookup features . . . . . . . . . . . . 26

5.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Wiktionary Lookup Feature . . . . . . . . . . . . . . . . . . . . . . . 28

5.4.1 Wiktionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4.2 Detection based on Wiktionary . . . . . . . . . . . . . . . . . 285.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.5 Google Hit Counts Feature . . . . . . . . . . . . . . . . . . . . . . . . 305.5.1 Detection based on Google hit counts . . . . . . . . . . . . . . 315.5.2 Estimated size of the web corpus . . . . . . . . . . . . . . . . 315.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.6 Performance gap between German test sets . . . . . . . . . . . . . . . 335.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Comparison and combination of features 356.1 Combination of features . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.1.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.1.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.1.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 376.1.2.2 Decision Trees based on continuous features . . . . . 37

6.1.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 386.1.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2.1 Abbreviations, hybrid and other foreign words . . . . . . . . . 40

7 Conclusion 437.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.1.1 Improved handling of ambiguous words . . . . . . . . . . . . . 447.1.2 Additional features . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 45

1. Introduction

1.1 Motivation and objectivesAs English is the prime tongue of international communication, English terms arewidespread in many languages. This is particularly true for the IT sector but notlimited to to that domain. Anglicisms – i.e. words borrowed from English intoanother language (the so called matrix language) – nowadays come naturally tomost people but this mix of languages poses a challenge to systems dealing withlanguage or speech.

Automatic speech recognition (ASR) and speech synthesis systems need correctpronunciations for these words and names of English origin. Yet a grapheme-to-phoneme (G2P) model of the matrix language does often not give appropriatepronunciations for these words. Most people nowadays are fluent in English andpronounce Anglicism according to their original pronunciation. Automatic detec-tion of Anglicisms would enable us to use more adequate, English pronunciationrules to generate pronunciations for them. The improved pronunciations in this sce-nario exclusively result from better G2P rules and not from an adapted phonemeset. [Mansikkaniemi and Kurimo, 2012] show that adding pronunciation variants forforeign words to the pronunciation dictionary reduces the word error rate of theirFinnish ASR system. The pronunciation variants are generated by a G2P model ofa mix of foreign languages.

Another possible application of Anglicism detection is the field of natural languageprocessing. [Alex, 2008a] shows that different parsers profit from the detection offoreign words.

In this thesis we develop new methods to automatically detect Anglicisms from wordlists of different matrix languages and advance existing approaches. The term matrixlanguage designates the main language of a text from which we try to distinguish theinclusions of English origin. We evaluated our features on the two matrix languagesGerman and Afrikaans. However our methods can easily be adapted to new matrixlanguages.

In our scenario we receive single words from texts in the matrix language as input.As output we produce a classification between the classes English and native. While

2 1. Introduction

Figure 1.1: Overview of the Anglicism detection system

some related work heavily relies on information about the word context like part-of-speech, we concentrate on context-independent features in this work and only a singleword by itself in order to classify it. This gives some extra flexibility as the inputdoes not have to be a well formed sentence. Other features which use informationabout the word context to detect Anglicisms could still be easily integrated in futurework.

We leverage different sources of expert knowledge and unannotated training textto create features that are mostly language-independent and cheap to set up. Bydeveloping features based on commonly available training data such as unannotatedword lists, spell-checker or pronunciation dictionaries, we avoid the expensive stepof hand-annotating Anglicisms directly in lots of training data. This also enables usto use available frameworks for the implementation of our approaches (e.g. tools forgrapheme or G2P models).

To be adapted to a specific matrix language, some of our features require moreexpensive resources like pronunciation dictionaries of that language. For Englishand many other languages, pronunciation dictionaries are available. To account forscenarios where a pronunciation dictionary is not available or of poor quality, wealso evaluated our G2P dependent feature in simulated situations with only fewresources.

1.2 Challenges

1.2.1 Ambiguous words

When a word of English origin is used, speakers may adapt the word to their lan-guage. Grammatical inflictions may be applied, especially for verbs (e.g. in German:”to download” -> ”downloaden”). Also German uses many compound words, whichcan consist of German and English parts (e.g. ”Shuttleflug”).

1.3. Contribution and outline 3

These hybrid words have properties of English as well as the matrix language Ger-man. The definition of these words as either English or native is somewhat ambigu-ous. We marked them as English in our reference annotation because for applicationsin speech it would be important to also find hybrid English words, which contain atleast one English part.

Naturally, classification of such hybrid words is especially challenging and is thecause for many of our classification errors. We have annotated those hybrid wordsin our German test sets for detailed analysis. Methods of compound splitting andword stemming should be looked into in future work to improve the handling ofthose kind of words.

1.2.2 Closely related languages

We find a lot of words in both English and the matrix language with the same spellingwhich are not borrowed from English (e.g. ”Information”, ”Evolution”, ”Hand”, ...between German and English). Apart from some exceptions, whose classificationdepends on the word context, such words are words of the matrix language. Theportion of those words is significant: More than 12% of the words in the cleanedGerman GlobalPhone dictionary [Schultz et al., 2013] are also found in the EnglishCMUdict [Carnegie Melon University, 2007].

All languages we evaluate in this work – English, German and Afrikaans – areWest Germanic languages [van der Auwera and Konig, 1994] and therefore closelyrelated to each other. Distinction between these languages can be expected to bemore difficult than between languages with less common origins.

1.2.3 Variance between domains

The performance of Anglicism detection depends on the domain of the test set.We find a major performance gap between our German test sets, similar to theobservations of [Alex, 2005]. In section 5.6 we analyze this in detail.

Our combination of features particularly improves the detection on those ”difficult”domains.

1.3 Contribution and outline

We developed and evaluated a set of different features to detect English words inword lists:

• Grapheme perplexity

• G2P confidence

• Dictionary lookup

• Wiktionary lookup

• Google hit count

4 1. Introduction

Those features were separately tuned and evaluated before we proceeded to combinethem. For the combination we experimented with different methods:

• Voting

• Decision Tree

• Support Vector Machine (SVM)

The remainder of this thesis is organized as follows:

Chapter 2 contains additional theoretical background about foreign word inclusions,evaluation metrics and the tools we used. In chapter 3 relevant related work issummarized. Chapter 4 presents information about domain, characteristics andthe creation of our test data sets. The features and their required resources aredescribed in detail in chapter 5. There we also present the analysis and results in eachfeatures’ section. Our approaches to combine them are shown in chapter 6, wherewe also compare the different features, their influence and the combined detectionperformance. Chapter 7 summarizes our work and gives an outlook on possiblefuture work.

2. Basics

2.1 Types of foreign words and AnglicismsAs speakers of different languages interact – now more than ever – languages influ-ence each other. While this has recently become very common in areas like infor-mation technology, it is not a new phenomenon.

The term Anglicism denotes the influence of the English language on other languages[Anderman and Rogers, 2005] and covers the phenomena briefly described in thefollowing sections. In the following, we use embedded language (also called donorlanguage in some work) to describe a language from which the influence originates(in the case of Anglicisms this is English), whereas matrix language (also recipientlanguage) is the main language, which is influenced.

In order to improve pronunciation generation for Anglicisms after detecting them,our work focuses on the detection of the following categories of Anglicisms:

2.1.1 Loan word

A loan word is a lexical item borrowed from an embedded language. Unlike a code-switch, a loan word is integrated into the matrix language. Spelling and pronuncia-tion can be adapted to the matrix language to varying extent [Haspelmath, 2007].

2.1.2 Pseudo-Anglicism

Spelling and pronunciation of a Pseudo-Anglicism in the matrix language are basedon English. The word does not have the same (or any) meaning in English though[Anderman and Rogers, 2005]. Examples of German Pseudo-Anglicisms are ”Beamer”(for English ”projector”) and ”Handy” (for English ”mobile phone”).

2.1.3 Hybrid word

A hybrid word consists of parts originating from different languages [Matthews, 2007].In general, words like ”Aquaphobia”– from Latin ”aqua” (water) and Greek ”phobia”(fear) – are considered as hybrid words. In this work we only deal with hybrid wordsthat have an English part. Examples in German are compounds like ”Schadsoftware”and grammatically conjugated forms of loaned English verbs like ”downloaden”.

6 2. Basics

2.1.4 Other forms of language contact

There are further phenomena of language contact, briefly described in the following.We do not deliberately detect code-switching in our work because it has differentcharacteristics and our focus is on common Anglicism rather than more elaboratecode-switches. Neither do we deal with loan translations as they do not need to behandled differently from native words in speech systems.

2.1.4.1 Code-Switching

Code-switching describes a speaker using partial sentences of an embedded languagein the matrix language. In a code-switch the words are copied as they are from theembedded language and not adapted in structure or pronunciation [Alex, 2008a].

2.1.4.2 Loan translation

A loan translation is a word created in the matrix language by literal translation ofa term. The expression and meaning is borrowed from the embedded language butthe words themselves are native to the matrix language [Carstensen et al., 2001a].Examples in English are ”flea market” from the French ”marche aux puces” or ”loanword” from the German ”Lehnwort”.

2.2 Metrics

The ratio of English and native words is unbalanced in our test sets. There aremuch less English words because our objective is to detect Anglicism included intext of a matrix language. F-score, which gives equal weight to precision and recall,is therefore a better measure than a simple overall accuracy. The related workconsistently reports f-score or precision and recall.

In the following sections we briefly define those metrics, which we use throughoutthis work.

2.2.1 Precision

Precision measures which part of the detected instances is in fact relevant:

precision =

∑true positive∑

tested positive

where a true positive is an instance correctly detected as relevant and a tested positiveis any instance detected as relevant [Powers, 2011].

For Anglicism detection precision therefore is the portion of real Anglicisms amongall words classified as Anglicism by the detection system.

precision =

∑detected Anglicisms∑all words detected

2.2. Metrics 7

2.2.2 Recall

Recall measures which part of the relevant instances is correctly detected:

recall =

∑true positive∑

positive

where a true positive again is an instances correctly detected as relevant and a posi-tive is any instance that is relevant according to the reference annotation [Powers, 2011].

For Anglicism detection recall therefore is the portion of Anglicisms correctly de-tected by the system among all Anglicisms in the test set.

recall =

∑detected Anglicisms∑

all Anglicisms in test set

2.2.3 F-Score

F-score considers both precision and recall [Powers, 2011]:

f-score = 2 · precision · recall

precision + recall

In its more general form the Fβ measure can put different emphasis on precision andrecall using the parameter β. This would be relevant if for example detecting allAnglicisms is more important than avoiding wrong detection of native words. Thereis a trade-off between precision and recall: A system tuned for broad detectionusually returns more relevant instances (increasing recall) but also more irrelevantinstances (decreasing precision).

When applying our Anglicism detection to generate pronunciation variants for En-glish words, it might be useful to give a higher weight to either precision or recall.If precision has a higher weight, the addition of wrong and confusing pronunciationvariants for native words is avoided. On the other hand, if recall has a higher weight,more words are detected and the necessary pronunciation variants for most of theAnglicisms can be added. The best choice for this remains to be examined in futurework.

To evaluate the general detection performance and compare our results to the relatedwork, we only use the traditional F1 score, which gives equal weight to precision andrecall.

8 2. Basics

3. Related work

Different methods have been developed to detect foreign or English inclusions intext. [Ahmed, 2005] intends to improve text-to-speech by identifying the languageof foreign inclusions and choosing the correct grapheme-to-phoneme models to gen-erate pronunciations. [Alex, 2008a] focuses on the use of English inclusion detectionto improve part-of-speech parsing and shows that an added foreign word detectioncomponent can improve parser quality. [Mansikkaniemi and Kurimo, 2012] add pro-nunciation variants for automatically detected foreign words. Their Finnish auto-matic speech recognition system’s word error rate was thereby reduced by up to8.8% relative. Handling foreign words in a customized way is also relevant to ITproducts and a simple approach to detect foreign words in the pronunciation dic-tionary and generate different pronunciations for them has already been patented[Alewine et al., 2011].

In addition to an overview of the work about foreign word detection (sometimes alsorefered to as ”Foreign Entity Recognition (FER)”), we summarize some related workfrom the fields of ”Named Entity Recognition (NER)” and language identification.While these have different objectives, approaches to detect specific kinds of wordsin NER or classify languages give us some further concepts to consider.

3.1 Foreign word detection

Most groups use grapheme-level methods based on grapheme n-gram likelihoods.[Mansikkaniemi and Kurimo, 2012] focus on the effects of pronunciation variants onASR and use a simple perplexity threshold, treating the 30% of words with thehighest perplexity as foreign word candidates. More advanced methods compareprobabilities of n-grams between models for different languages. [Jeong et al., 1999]and [Kang and Choi, 2002] compare syllable probabilities between a Korean and aforeign model and additonally extract the foreign word stem. [Ahmed et al., 2005]develop a ”Cumulative Frequency Addition” that distinguishes between a numberof different languages. N-gram frequencies within each language model and over alllanguages are calculated to classify a word. We adapt the comparison of n-gramprobabilities to Anglicism detection for our ”grapheme perplexity” feature.

10 3. Related work

[Kang and Choi, 2002] point out that specific resources can be used to set up dif-ferent aspects of a detection system (e.g. an annotated treebank to train the parserand foreign dictionary to do a lookup) rather than only using large amounts of an-notated text for machine learning. We pick up this idea and only use easily availableinformation sources for our features.

Another common approach to foreign word detection is a dictionary lookup. While[Andersen, 2005] finds that grapheme n-grams perform better than lexicon lookup,combination gives the best results. [Alex, 2005] also uses a dictionary lookup toreduce the number of English word candidates before applying more costly features.

An innovate method is the comparison of the number of Google search results foundfor different languages [Alex, 2005]. We implement this method ourselves to compareresults on our test sets and include it in our feature combination. The details of theapproach are described in section 5.5.

[Kundu and Chandra, 2012] interpolate probabilities of grapheme and phoneme lan-guage models for English and Bangla. Their classification is based on a comparisonbetween those probabilities. The phoneme sequences are generated with a G2Pconverter producing the pronunciations for Bangla and English transliterated wordsalike. Our G2P Confidence Feature uses a similar approach, also basing its classi-fication on a combination of phonememe- and grapheme-level information. For ourfeature we compare probabilities of graphone-level models.

Detection performance is usually evaluated in terms of f-score. Results vary for thedifferent methods and setups in the related work. [Kang and Choi, 2002] achieve88.0% f-score detecting foreign transliterations in Korean. [Ahmed et al., 2005]reach 79.9% distinguishing between several language pairs. Detecting English in-clusions in German text, [Alex, 2005]’s experiments are very similar to ours andgive comparable results of up to 77.2% f-score.

3.2 Named entity recognition

Named Entity Recognition (NER) classifies person names, organization names andlocations among other categories like times and dates. As this task is also aboutdetecting special word categories – among the person, organization and locationnames even a lot of foreign words – we can take some ideas from this field of researchfor Anglicism detection as well.

Often the local context or specific trigger words are used for detection. Part-of-speech (POS) tags, capitalization and punctuation are also common features.([Munro et al., 2003], [Wolinski et al., 1995]).

[Baluja et al., 2000] train a decision tree to combine a diverse set of Boolean features.To combine our features, we apply a similar approach.

While [Bikel et al., 1997] work with word-based Hidden-Markov-Models (HMMs),[Klein et al., 2003] switch to character-level HMMs thereby achieving high errorreduction. We also develop advanced character-level features such as GraphemePerplexity or G2P Confidence at the graphone level.

4. Test sets

As no annotated German word lists for the task of Anglicism detection are freelyavailable, we created two German word lists covering different domains. [Alex, 2005]finds that classification performance of their English word detection system varieswidely between data from different domains. In order to also study the effect ofthe domain, we collected texts from different domains and created two independentword lists as test sets: One from the computer and software related domain (IT)and another from general news articles.

The following sections describe our annotation process and the domains of theseword lists Microsoft-de and Spiegel-de. We also analyse the composition of our twoGerman word lists and the Afrikaans NCHLT-af word list.

We evaluate our Anglicism detection on these lists of unique words not consideringtheir frequency in the original texts.

4.1 German

4.1.1 Annotation guidelines

In our German word lists English words and some additional word categories forfurther analysis were annotated. Chapter 2.1 defines the different types of languagecontact mentioned here. Like [Alex, 2005], we base our annotation on the agreementof annotators. In case of disagreement we consulted the well-known German dictio-nary Duden (www.duden.de) and checked the context in which the word occured inthe text. Annotation of the German word lists follows these guidelines:

English: All English words were tagged as ”English”. This comprises all types ofwords including proper names and also pseudo-Anglicisms. Words that could beGerman as well as English (homomorph words) were not tagged as English (e.g.Admiral, Evolution, . . . ). Neither are loan translations tagged as English. Wordsthat contain an English part (see Hybrid foreign word) were tagged as English be-cause a monolingual German G2P model cannot generate correct pronunciations forthose words.

www.duden.de

12 4. Test sets

Abbreviation: Abbreviations were tagged as ”abbreviation”. We did not distinguishbetween English and German abbreviations as our focus is to detect Anglicisms.Therefore no abbreviations were tagged as English.

Other foreign word: Foreign words that are neither German nor English were taggedas ”foreign”. As we limit our algorithms to classify exclusively between the categoriesEnglish and native (in this case German), these words fall into the native category.

Hybrid foreign word: Words that contain an English plus a German part were taggedas ”hybrid” in addition to ”English”. This covers for example compound words witha German and an English part (e.g. ”Schadsoftware”) and grammatically conjugatedforms of English verbs (e.g. ”downloaden”).

4.1.2 Word list from the IT domain

The IT domain word list Microsoft-de contains about 4.6k types crawled from theGerman website of Microsoft www.microsoft.de. To reduce the effort of hand-annotating a vast amount of data, this word list only contains frequent types thatoccured more than once in the crawled text.

Before extracting the types for our word list, some normalization and cleanup wasperformed on the crawled text:

• Removed all HTML

• Normalization of numbers into their written form

• Removed sentences containing more than 80% capital characters

• Replaced all punctuation including hyphens by spaces

• Removed words consisting only of numbers

• Removed single character words

• Removed words that only occured once

4.1.3 Word list from the news domain

The general news domain word list Spiegel-de contains about 6.6k types from 35articles covering the domain of German political and business news. The texts weremanually taken from the website of the German news journal Spiegel www.spiegel.de. The texts have not been crawled automatically to keep the word list clean ofadvertisements, user comments and other unwanted content.

Before extracting the types for our word list, some normalization was performed onthe collected text:

• Replaced all punctuation including hyphens by spaces

• Removed words consisting only of numbers

• Removed single character words

www.microsoft.de

www.spiegel.de

www.spiegel.de

4.2. Afrikaans 13

4.2 Afrikaans

4.2.1 The NCHLT-af word list

The NCHLT-af word list contains about 9.4k types taken from the Afrikaans part ofthe NCHLT corpus [Heerden et al., 2012], which contains a collection in the elevenofficial languages of South Africa. In our Afrikaans test set English and foreign wordsas well as abbreviations have been annotated by [Basson and Davel, 2013] for theirwork. The authors kindly provided this annotated word list for our experiments.

4.3 Statistics

Microsoft-de Spiegel-de NCHLT-af

Tokens in corpus (running words) 86,213 26,097 -Types in test set (unique words) 4,558* 6,596 9,380English words 686 260 198Other foreign words 107 160 49Abbreviations 180 58 298

Table 4.1: Number of words in different categories (* frequent words occuring >1;NCHLT-af was only provided to us as a list of unique words)

Table 4.1 shows the number of words in the original texts as well as the compositionof the different word categories in the word lists. This is also illustrated in figure 4.1.

Figure 4.1: Foreign words in different word lists

Especially in the IT domain we find many foreign words and abbreviations. Thoseare more than 21% on the Microsoft-de word list, where 15% of all words are English.In the general news domain (Spiegel-de) we find only approximately 4% Englishwords. This large difference in terms of English words has a big influence on theAnglicism detection performance which is analyzed in chapter 5.6.

14 4. Test sets

About 10% of the English words in our German word lists from each domain are”hybrid” words, consisting of German and English parts (e.g. ”Schadsoftware”) asshown in table 4.2.

Microsoft-de Spiegel-de

English types 686 260Included hybrid types 70 (10.20%) 29 (11.15%)

Table 4.2: Number of English hybrid words

5. Features for Anglicismdetection

To detect Anglicisms we advanced existing methods and developed entirely newfeatures. In addition to the evaluation of new approaches, an important goal wasan inexpensive setup and the portability to new languages.

In the following, we use the term matrix language for the main language of the textin which we detect inclusions of English as the embedded language.

In contrast to standard supervised machine learning, our features do not rely ontraining data that is annotated specifically for the task of Anglicism detection. Thetest sets presented in chapter 4 are only used for evaluation of our single featuresand never for their training. Instead we use common resources like word lists andpronunciation or spell-checker dictionaries. An exception to this are only our featurecombinations, as described in chapter 6, which are trained through a cross validationon the test sets.

To avoid supervised training of thresholds, we often base our classification on thedifference between results calculated on an English model and a model of the matrixlanguage. This also improves detection performance significantly.

The innovative approach based on grapheme-to-phoneme (G2P) conversion confi-dence is one of our most successful features as shown in section 5.2. Our novel Wik-tionary Lookup Feature especially improved the feature combination, as describedin chapter 6.

5.1 Grapheme Perplexity Feature

5.1.1 Grapheme-level models

A grapheme is the smallest semantically distinguishing unit in a written language[Duden, 2014]. Grapheme-level detection of foreign words is based on the assumptionthat common grapheme sequences differ depending on the language. For example,in our word lists English words often end with ”cs” or ”ly”, while German wordsoften end with ”ln” or ”en”.

16 5. Features for Anglicism detection

Figure 5.1: Overview of the Anglicism detection system

A grapheme (or character) n-gram is a sequence of n graphemes. Grapheme-levellanguage models are trained from lists of training words. These models are a sta-tistical representation of grapheme n-grams over all training words. In additionto the graphemes, word boundary symbols are included to specifically identify thegrapheme n-grams at the beginning and end of words.

We used the SRI Language Modeling Toolkit (SRILM) [Stolcke, 2002] to buildour models, which is freely available on their website http://www.speech.sri.com/projects/srilm/.

The detection based on grapheme n-gram models deals well with conjugations andsmall variations of words. Unknown forms of a word can still be recognized becausethe overall grapheme sequences stay similar. Therefore many works in the field ofNamed Entity Recognition and Foreign Entity Recognition are based on grapheme n-grams ([Klein et al., 2003], [Mansikkaniemi and Kurimo, 2012], [Jeong et al., 1999],[Kang and Choi, 2002], [Ahmed et al., 2005]).

5.1.2 Perplexity

Perplexity is a measure used to compare probability models. It is defined as afunction of the cross-entropy H in [Indurkhya and Damerau, 2010]:

PPL(w) = 2H(w)

with the cross-entropy

H(w) = − 1

|w|log2 Pr(w) = − 1

|w|log2

(Pr(c1|<s>) Pr(c2|<s>c1) Pr(c3|<s>c1c2) · · ·

)

http://www.speech.sri.com/projects/srilm/

http://www.speech.sri.com/projects/srilm/

5.1. Grapheme Perplexity Feature 17

for a grapheme-level language model with respect to a sample word w with a lengthof |w| characters. cn denotes the n-th character of the word w and <s> representsthe beginning word boundary.

If a word w fits the model well, it has a higher probability Pr(w), a more negativecross-entropyH(w) and therefore a lower perplexity because the cross-entropy standsin the exponent.

Like [Mansikkaniemi and Kurimo, 2012] we chose the perplexity for our feature be-cause it is normalized with respect to the word length. Our Anglicism detectionbased on perplexity difference as described in section 5.1.5 also achieves similarresults with log probabilities.

5.1.3 Training grapheme-level language models

We experimented with different training word lists and parameters to build grapheme-level language models. Best detection performance was achieved using modelsbuilt from lists of unique, lowercase training words with an n-gram order of 5.The training word lists are based on the words in the CMU Pronouncing Dictio-nary (CMUdict) [Carnegie Melon University, 2007] for the English model and theGerman GlobalPhone dictionary [Schultz et al., 2013] for the German model. TheAfrikaans model was trained on words crawled from the Afrikaans news website<www.rapport.co.za>. Table 5.1 gives details about the source and amount oftraining data for our models.

Language model Training source Training words

English CMUdict words 116kGerman GP-de dict. words 37kAfrikaans crawled types (www.rapport.co.za) 27k

Table 5.1: Training words for grapheme-level language models

The training word lists contained one word per line with spaces separating individualcharacters in order to train at the grapheme-level.

5.1.4 Detection based on absolute perplexity threshold

Our first approach to detect Anglicisms is based on an absolute threshold T forthe perplexity. The same method is used by [Mansikkaniemi and Kurimo, 2012].The perplexity is calculated on a grapheme-level model of the matrix language. Allwords with a perplexity higher than T are classified as English, all words with lowerperplexity than T are classified as native. The underlying assumption is that in amodel of the matrix language the grapheme sequence of an English word is less likelythan the grapheme sequence of a word from the matrix language. Accordingly theperplexity of English words should be higher than the perplexity of native words.

The approach has some shortcomings:

First, we need to train a suitable value for the threshold T in a supervised way. Thisrequires additional word lists for training in which all Anglicisms are annotated.Our oracle experiments in figure 5.2 show that the optimal T varies for the test sets

<www.rapport.co.za>


Figure 5.2: Detection performance in relation to an absolute perplexity threshold

from different domains. Therefore the threshold would have to be trained for eachlanguage and domain individually.

More important, with such a threshold we cannot exclusively detect English wordsbut rather detect more generally ”uncommon”words. Non-English foreign words andespecially abbreviations also have a high perplexity and are therefore misclassifiedas English words. While we reach decent results of up to 73.36% f-score for thedetection of all foreign words including abbreviations with this method, the bestresult to detect only Anglicisms falls to 55.94% f-score.

As our final Grapheme Perplexity Feature for Anglicism detection we therefore de-veloped a method based on perplexity difference described in the next section, whichmitigates the afore mentioned problems.

5.1.5 Detection based on perplexity difference

As described in the preceding section, Anglicism detection with an absolute perplex-ity threshold poses some challenges. Our approach of using the perplexity differencebetween two models allows:

• Unsupervised classification based on a direct comparison of perplexities for anEnglish model and a model of the matrix language

• Focused detection of Anglicisms instead of a broad recognition of uncommonwords

Figure 5.3 depicts the steps of our Grapheme Perplexity Feature:

5.1. Grapheme Perplexity Feature 19

Figure 5.3: Classification with the grapheme perplexity difference

1. Preparation: Training of grapheme-level language models from training wordlists for English and the matrix language

2. Calculation of the perplexity on the English model and the model of the matrixlanguage for a word from the test set

3. Comparison of the two perplexities and classification towards the model whoseperplexity is lower for the word

The feature uses the difference of the English and matrix language perplexities. Wecalculate

d = pplmatrixlang.(w)− pplEnglish(w)

and classify a word w as English if the difference d is greater than zero. We gener-ically assume a threshold of zero, which leads to a simple comparison of whichperplexity is smaller.

This is not an optimal choice as shown in table 5.2. We still make this trade-offto refrain from supervised training of a better threshold. Figure 5.4 illustrates thef-scores in relation to the perplexity difference threshold in detail. The differentoptimal thresholds seem to be related to the portion of Anglicisms in the test set.Microsoft-de contains almost four times as many Anglicisms as Spiegel-de. Thegeneral performance gap between the two German domains is analyzed in section 5.6.

The classification of a word w should only depend on the word itself and the model.A further normalization by standard score (z-score) over the perplexities of all wordsof the test set led to worse results.


Test setThreshold = 0 Optimal threshold

F-score F-score Threshold

Microsoft-de 67.17% 68.56% 0.5Spiegel-de 36.00% 45.61% 6.5NCHLT-af 25.75% 29.87% 2.5

Table 5.2: Detection performance with different thresholds for the grapheme per-plexity difference

Figure 5.4: F-score in relation to perplexity difference threshold

5.1.6 Results

The results of the final version of our Grapheme Perplexity Feature described in theprevious section are shown in table 5.3 for the different test sets.

We achieve a good recall, for which the Grapheme Perplexity Feature is one of ourbest features. This means most Anglicisms in the test sets are indeed detected.Precision is considerably lower indicating that the feature is wrongly detecting a lotof words which are not Anglicisms.

Different performance between the two German domains Microsoft-de and Spiegel-de is caused by the different portions of Anglicisms in the test sets. It is furtheranalyzed in section 5.6.

5.2 G2P Confidence Feature

5.2.1 G2P conversion

Grapheme-to-Phoneme (G2P) conversion tries to predict a word’s pronunciationbased on its spelling. State of the art G2P converters train probabilistic models with

5.2. G2P Confidence Feature 21

Test set F-score Precision Recall

Microsoft-de 67.17% 55.85% 84.26%Spiegel-de 36.00% 22.73% 86.54%NCHLT-af 25.75% 15.22% 83.50%

Table 5.3: Performance of the Grapheme Perplexity Feature

a pronunciation dictionary that contains words alongside their correct pronunciation.From this, G2P relations of a language can be learned and afterwards applied towords whose pronunciation is unknown.

Naturally, these are language specific. Some languages like Spanish or Italian havea consistent G2P correspondence and pronunciations are thus straightforward togenerate. English pronunciations are much harder to predict because of many in-consistencies [Novak et al., 2012].

5.2.2 Phonetisaurus

We use Phonetisaurus [Novak et al., 2012], an open source WFST-based G2P con-version toolkit, for our experiments.

Phonetisaurus takes the following steps to predict pronunciations:

1. Alignment of graphemes and phonemes in the training dictionary (creatinggraphones)

2. Training of a graphone-level language model

3. Prediction of pronunciations for novel words

In the alignment step graphemes are combined with the phonemes from the corre-sponding pronunciation. We call the resulting grapheme-phoneme clusters graphonesas the term is used in other literature [Bisani and Ney, 2008].

From all graphone sequences of the training dictionary a graphone-level languagemodel is then trained. The n-gram order of our graphone model is 7, which is theproposed default value in Phonetisaurus’ documentation [Novak, 2012].

To predict pronunciations, Phonetisaurus searches the shortest path in the G2Pmodel which corresponds to the input grapheme sequence. As path cost the gra-phones’ negative log probabilities are summed up. This value can be seen as aconfidence measure: It is used to rank different pronunciation variants. In our ex-periments, we find that this G2P confidence can also be compared between a word’spronunciation variants generated from different G2P models.

Sequitur G2P [Bisani and Ney, 2008], another G2P tool, has a different concept ofconfidence scores. Instead of a global score in relation to the whole model, SequiturG2P calculates a probability distribution between pronunciation variants for oneword. A comparison of these probabilities between different models is not useful forour purpose of detecting Anglicisms.


5.2.3 Pronunciation dictionaries

To train a G2P model, a lot of example word pronunciation pairs are needed. For-tunately, pronunciation dictionaries for many languages are available. For our G2PConfidence Feature we depend on a G2P model for English and one for the matrixlanguage.

We use the CMU Pronouncing Dictionary (CMUdict) [Carnegie Melon University, 2007]for English, the German GlobalPhone dictionary (GP-de) [Schultz et al., 2013] andthe Afrikaans pronunciation dictionary (dict-af) created by [Engelbrecht and Schultz, 2005].Table 5.4 shows the size of the dictionaries.

Pronunciation dictionary word pronunciation pairs

CMUdict 133,229GP-de 37,753dict-af 42,153

Table 5.4: Size of pronunciation dictionaries used for training

5.2.4 Detection based on G2P confidence

Our G2P Confidence Feature is conceptually similar to our Grapheme PerplexityFeature. We only compare scores for a word at graphone-level instead of grapheme-level. This added pronunciation information leads to better distinction betweenAnglicisms and native words.

Figure 5.5: Classification with the G2P Confidence Feature

Figure 5.5 illustrates our steps to detect Anglicisms based on G2P confidence:

1. Preparation: Training of G2P (graphone) models from English and matrixlanguage pronunciation dictionaries

5.2. G2P Confidence Feature 23

2. Prediction of pronunciation for a word from the test set

3. Comparison of G2P confidence and classification towards the language forwhich the confidence is better

As described we use Phonetisaurus for pronunciation prediction and rely on itsconfidence measure, the negative log probability of the graphone sequence. TheG2P confidence of the first-best pronunciation for a word is used, while the generatedpronunciation itself is discarded.

The feature uses the difference of the G2P confidence for English and the matrixlanguage. We calculate

d = G2Pconfbaselang.(w)−G2PconfEnglish(w)

and classify a word w as English if the difference d is greater than zero. We generi-cally assume a threshold of zero, which leads to a simple comparison of which G2Pconfidence is smaller. Like for the grapheme perplexity difference, this is not anoptimal choice as table 5.5 shows. Again we make this trade-off to refrain fromsupervised training of a better threshold.



Microsoft-de 70.39% 71.40% 1.0Spiegel-de 40.56% 45.00% 1.0NCHLT-af 23.94% 40.23% 10.0

Table 5.5: Detection performance with different thresholds for the G2P confidencedifference

For the two German test sets Microsoft-de and Spiegel-de, the optimal threshold isequally at 1. This depends on the language or the dictionary as we only reach agood detection performance from a threshold of 7 or higher for the Afrikaans testset NCHLT-af as shown in figure 5.6.

5.2.5 Results

The results of our G2P Confidence Feature are shown in table 5.6. They are similarto the performance of our Grapheme Perplexity Feature described in section 5.1 withsome improvements for the German test sets. An in-depth comparison of all featuresis done in section 5.6.

We achieve a good recall, for which the G2P Confidence Feature together with theGrapheme Perplexity Feature is our best feature. This means most Anglicisms inthe test sets are indeed detected. The precision is considerably lower indicating thatthe feature is wrongly detecting a lot of words which are not Anglicisms.


Figure 5.6: F-score in relation to the G2P confidence difference threshold



Table 5.6: Performance of the G2P Confidence Feature

5.2.6 Effect of dictionary size

A pronunciation dictionary is expensive to create and good quality resources may notbe available for a language. With less word pronunciation pairs for training, our G2PConfidence Feature’s detection performance will obviously deteriorate because theunderlying G2P model is less well defined. We simulated such resource-constraintsituations to study the effect of training dictionary size on the detection performance.

We selected random sets of word pronunciation pairs from the German GlobalPhonepronunciation dictionary (GP-de) and evaluated detection performance for a dictio-nary size of 200, 500, 1k, 5k and 10k entries. The English G2P model was alwaystrained on the full CMUdict because there is no problem to obtain this. The CMU-dict is freely available and the English model does not need to be replaced whenadapting our system to a new matrix language.

Table 5.7 shows the detection performance for our standard G2P Confidence Fea-ture with the different dictionary sizes. The f-score keeps improving with a largerdictionary to train the German G2P model. Particularly with the fixed threshold ofzero a large dictionary is necessary. For smaller G2P models the optimal thresholdis much further from zero, as shown in figure 5.7.

5.3. Hunspell Lookup Features 25

Dictionary size (entries)Microsoft-de Spiegel-de

F-score Precision Recall F-score Precision Recall

200 21.76% 13.37% 58.45% 14.22% 7.68% 96.55%500 46.96% 31.37% 93.29% 16.90% 9.27% 96.17%1k 51.02% 35.36% 91.55% 19.04% 10.59% 93.87%5k 60.02% 45.15% 89.50% 25.82% 15.07% 90.04%

10k 64.26% 50.50% 88.34% 31.02% 18.79% 88.89%full dict (37k) 70.39% 59.44% 86.30% 40.56% 26.74% 83.91%

Table 5.7: Detection performance with different dictionary sizes for default thresholdof zero

For a G2P model built from a small dictionary, the recall is very high, while theprecision is extremely low. Most words’ graphone sequences are not covered in sucha G2P model. Accordingly, their probability is low and in comparison many wordsare more likely in the English G2P model.

Figure 5.7: F-score in relation to the G2P confidence difference threshold for differentdictionary sizes on the Microsoft-de test set

5.3 Hunspell Lookup Features

5.3.1 Hunspell spell-checker

Hunspell1 is an open source spell-checker and morphological analyzer used in soft-ware like OpenOffice. It supports complex compounding and morphological analysisand stemming. The word forms are recognized based on rules defined in the spell-checker dictionary of a language.

1hunspell.sourceforge.net

hunspell.sourceforge.net


Hunspell spell-checker dictionaries are freely available for many languages, includingEnglish, German and Afrikaans. For our feature we used those Hunspell resources:The American English dictionary (en US), the ”frami” version of the German dic-tionary (de DE-frami) and the Afrikaans dictionary (af ZA).

5.3.2 Detection based on Hunspell lookups

Our Hunspell Lookup Features simply check whether a word is found in the dictionaryof the language. We use two independent features with this concept:

• English Hunspell Lookup and

• Matrix language Hunspell Lookup

Figure 5.8 illustrates the steps of our English Hunspell Lookup Feature. The lookupincludes an automatic check if the word in question can be derived by the morpho-logical or compound rules in the dictionary.

Figure 5.8: Classification with the English Hunspell Lookup Feature

5.3.2.1 English Hunspell lookup

If the word is found or derived from the English dictionary, it is classified as English.Any word not in the English dictionary is classified as native. This feature is lan-guage independent and can be used without modification for any matrix language.

5.3.2.2 Matrix language Hunspell lookup

The Matrix Language Hunspell Lookup Feature does a lookup in the spell-checkerdictionary of the matrix language accordingly. In this case a word found or derivedfrom the matrix language dictionary is classified as native. Any word not in thematrix language dictionary is classified as English.

5.3.2.3 Combination of lookup features

Our two lookups, the Matrix Language Hunspell Lookup Feature and the EnglishHunspell Lookup Feature, are independently evaluated. Their classifications candisagree if a word is found in both dictionaries or in neither dictionary.

We also experimented with combinations of both features. We only classified a wordas native if it was in the matrix language dictionary, while not being in the Englishdictionary. All other words were classified as English. However, this did not lead tobetter Anglicism detection.

5.3. Hunspell Lookup Features 27

5.3.3 Results

Table 5.8 shows the performance of our English Hunspell Lookup and Matrix Lan-guage Hunspell Lookup Features.

Test setEnglish Hunspell Lookup Matrix Language Hunspell Lookup


Microsoft-de 63.65% 57.13% 71.87% 61.29% 58.09% 64.87%Spiegel-de 39.17% 26.41% 75.77% 41.65% 35.50% 50.38%NCHLT-af 31.90% 19.27% 92.50% 12.48% 6.71% 89.50%

Table 5.8: Detection performance of English and Matrix Language Hunspell Lookup

The Afrikaans spell-checker dictionary seems to be of bad quality leading to weakdetection performance. More than 25% of Afrikaans words were not found in thedictionary, as shown in figure 5.9. On the other hand, the English Hunspell LookupFeature is our best feature for the NCHLT-af Afrikaans test set.

In figure 5.9 the portion of English and native words from the test sets found ineach dictionary are illustrated. A significant amount of English words are alreadycommonly used in German and included in the German dictionary – these lead tofalse positives. Especially non-English foreign words and abbreviations, which areclassified as native in our reference, make up the portion of ”native” words not foundin the German dictionary (false negatives).

Figure 5.9: Portion of words found in each dictionary

In the English dictionary the false negatives (Anglicisms which were not detected)are mainly hybrid words which contain an English as well as a German part. They arepart of the English class in our reference annotation. There is also a sizable amountof words that are spelled exaclty the same in both English and German. Together


with names this makes up most of the false positives of the English Hunspell LookupFeature.

5.4 Wiktionary Lookup Feature

5.4.1 Wiktionary

Wiktionary2 is a community-driven online dictionary available in many languages.Like Wikipedia, the content is written by volunteers. Wiktionary is available forover 150 languages but scope and quality in the different languages vary. Whilethe English and French Wiktionary each contain more than a million entries, theGerman Wiktionary currently has approximately 355,000 entries and the AfrikaansWiktionary less than 16,000. However, the Wiktionary project is growing rapidly[Schlippe et al., 2013]. That dynamic growth of Wiktionary is an advantage for ourapproach because information about recently introduced words is likely to be addedin the future.

Wiktionary provides a wide range of information. For example, [Schlippe et al., 2013]have used Wiktionary to extract pronunciations.

For most words, Wiktionary contains a paragraph about the word’s origin. TheWiktionary of one language does not only contain words from that language. Foreignwords including the name of the source language are also added. The example infigure 5.10 shows the Anglicism ”downloaden”as a German word (”Deutsch”meaningGerman) which is originating from English (explained in the section ”Herkunft”meaning origin).

Figure 5.10: Entry of German Wiktionary containing a paragraph about the word’sorigin (”Herkunft”) and language (”Deutsch” meaning German)

5.4.2 Detection based on Wiktionary

To detect Anglicisms, we only use the information from the matrix language’s Wik-tionary version. A word is classified as English if:

• There is an entry for this word belonging to the matrix language and the originsection contains a keyword indicating English origin or

2www.wiktionary.org

www.wiktionary.org

5.4. Wiktionary Lookup Feature 29

• There is no entry belonging to the matrix language but an entry marked as”English” in the matrix language’s Wiktionary

Unfortunately, entries of the Wiktionary versions from different languages do nothave a common style and structure. Therefore some language-dependent fine tuningis necessary.

In the German Wiktionary we check for the keywords ”englisch”, ”engl.”, ”Anglizis-mus” and special Wiktionary markup indicating that the word is English. To avoidfalse positives for loan translations or ancient common origins, we exclude words thatcontain keywords like ”Ubersetzung” (translation) and ”altenglisch” (Old English) inthe origin section.

The German Wiktionary also contains many conjugations and word forms that arelinked to their principal form. We follow such links and classify a word based on theWiktionary entry of its principal form.

The Afrikaans Wiktionary is not as comprehensive. A section about word origin isnot available. Therefore we can only rely on the Wiktionary markup indicating thatan entry describes an English word.

Words that are not found at all in Wiktionary are treated as native words in ourevaluation. When we combine all of our features in chapter 6, we rather give thosewords a neutral value.

To speed up the procedure and reduce the load on the Wiktionary servers, we useda Wiktionary dump, which is available to download all content of a language’sWiktionary. We first extracted the relevant parts about the words’ language andorigin from the Wiktionary. From this smaller file the actual Wiktionary Lookup ofthe words from our test sets can be done faster.

5.4.3 Results

Table 5.9 shows the Anglicism detection performance of our Wiktionary LookupFeature.



Table 5.9: Performance of the Wiktionary Lookup Feature

The Afrikaans Wiktionary has very few entries. We found only 3.45% of the wordsfrom our test set, as shown in table 5.10. Therefore the Wiktionary Lookup Featurecannot give meaningful results for Afrikaans.

On the German test sets the Wiktionary Lookup Feature’s f-scores are lower thanthe f-scores of most of our other features. Detecting a much smaller portion of theAnglicism in the test sets, its recall is at least 13% absolute lower than any otherfeature’s. However, in terms of precision - indicating how many non-Anglicisms arewrongly detected - the Wiktionary Lookup is one of our best features. Hence it


Test set Words found Anglicisms found F-score on found F-score on all

Microsoft-de 71.15% 57.58% 71.61% 52.44%Spiegel-de 74.35% 59.23% 46.27% 36.85%NCHLT-af 3.45% 0.09% 35.48% 9.02%

Table 5.10: Portions of words found in the Wiktionary

contributes to a good performance in the combination of our features as describedin chapter 6.

Table 5.10 shows the portion of words found in Wiktionary. As the German Wik-tionary also contains many word forms, we have almost 75% coverage of all wordsfrom our German test sets. More than half of the annotated Anglicism also haveentries in the German Wiktionary. Evaluating the Anglicism detection only on thewords found in Wiktionary, we reach considerably higher f-scores which would beamong our best features’. But as figure 5.11 illustrates, more than half of the Angli-cisms from Microsoft-de found in the German Wiktionary are misclassified as native(9% misclassified vs. 6% correctly detected).

Figure 5.11: Portions of words from Microsoft-de found in Wiktionary with cor-rect/wrong information for Anglicism detection

5.5 Google Hit Counts Feature

Our Google Hit Counts Feature is an implementation of the Search Engine Moduledeveloped by [Alex, 2008a]. They use the method to detect English words in a twostep approach, first filtering potential English words with a dictionary lookup.

Many search engines offer the advanced option to exclusively search on websites ofa specific language. Given a correct language identification by the search engine,the assumption is that an English word is more frequently used in English, while aGerman or Afrikaans word is more frequently used in its language [Alex, 2008a].

[Alex, 2008b] notes that because current information is dynamically added, this web-based approach also deals well with unknown words like recent borrowings that havenot yet been entered into dictionaries.

5.5. Google Hit Counts Feature 31

5.5.1 Detection based on Google hit counts

Figure 5.12 illustrates the process of the Google Hit Counts Feature:

1. Search of a word from the test set with search results restricted to English

2. Search of a word from the test set with search results restricted to the matrixlanguage

3. Normalization of the number of search results from (1.) and (2.) with theestimated size of the web in each language

4. Comparison and classification towards the language for which the normalizednumber of search results is higher

Figure 5.12: Classification with the Google Hit Count Feature

As there is much more English than German content on the web – and for Afrikaansit is just a fraction –, the raw number of search results has to be normalized beforecomparison. The normalized number of hits of a word w returned for the search ineach language L is calculated as:

hitsnormalized(w,L) =hitsabsolute(w,L)

web-size(L)

These scores hitsnormalized(w, ’English’) and hitsnormalized(w, ’matrix language’) arecompared to classify the word w depending on which normalized score is higher.

5.5.2 Estimated size of the web corpus

Following [Alex, 2008a], we need the estimated size of the web corpus that is ac-cessible through the search engine in a specific language. This number, web-size(L)for a language L, is used to normalize the search hits before classification, as shownabove.

The estimation method was developed by [Grefenstette and Nioche, 2000]:


1. The frequencies of the 20 most common words are calculated within a largetext corpus of the language.

2. The search engine limited to pages of the language is queried for each of thesemost common words.

3. The number of search hits for each word is divided by its frequency in thetraining text. The resulting number is an estimate of the total number ofsearch results accessible in that language.

Like [Grefenstette and Nioche, 2000], we then remove the highest and the lowestestimates as potential outliers. The average of the rest of the estimates is the finalestimation of the web corpus size in the language.

For English and German we used the most common words and frequencies from[Grefenstette and Nioche, 2000] and calculated the web corpus sizes based on newGoogle hit counts for these words. For Afrikaans this information was not providedby [Grefenstette and Nioche, 2000]. Therefore we calculated the 20 most commonwords and their frequencies from the Afrikaans bible. The normalization based onthe bible text resulted in better detection performance than with a normalizationon current news articles from <www.rapport.co.za>.

Table 5.11 shows our estimations of the total number accessible search results ineach language.

Language Estimated size of web Ratio to English

English 3,121,434,523,810German 184,085,953,431 1 : 17Afrikaans 6,941,357,100 1 : 450

Table 5.11: Estimated size of the web in different languages

5.5.3 Results

Table 5.12 shows the Anglicism detection performance of our Google Hit CountsFeature.

On the Spiegel-de test set this is our best feature. The Google Hit Counts Featurealso has the highest precision among our features on both German test sets. Amongthe false positives (non-Anglicisms wrongly detected) are a lot of abbreviations,proper names and homomorphs (words that are spelled the same in English andGerman).

The false negatives (Anglicism that are not detected) consist of English words verycommonly used in German – like ”Computer”, ”online” or ”Facebook”. The normal-ization gives those words a higher score for the German search than for the Englishsearch. Many hybrid words containing English as well as German characteristics arealso among the false negatives.

<www.rapport.co.za>

5.6. Performance gap between German test sets 33



Table 5.12: Performance of the Google Hit Counts Feature

5.6 Performance gap between German test sets

For all of our features, the Anglicism detection performance is consistently betteron the Microsoft-de test set than the Spiegel-de test set. This large difference incase of different domains was also observed by [Alex, 2005]. It is demonstrated infigure 5.14. As both test sets are German, our features use identical resources fortheir classification. The vast f-score difference of up to 30% absolute for the samefeature is therefore somewhat surprising.

Figure 5.13 gives a more detailed view of the detection performance. It compares allfeature performances on Microsoft-de and Spiegel-de with respect to precision andrecall, which make up the f-score, as defined in chapter 2.2.

Figure 5.13: Precision-recall chart of all features on the Microsoft-de and Spiegel-detest sets

For most of our features, the recall is similar on both test sets. A comparable portionof the Anglicisms contained in each test set is correctly detected.

A principle difference between the test sets is visible in terms of precision. All ourfeatures consistently have lower precision on Spiegel-de with a gap of up to 30%absolute for the same feature.

As presented in table 5.13, the portion of false positives is very similar between bothtest sets. This indicates that depending on the feature between 2.6% and 12.07%of native words are wrongly detected as Anglicisms. The different test sets do nothave a big influence on this false positive rate.


Feature Microsoft-de Spiegel-de

G2P Confidence 10.43% 9.47%Grapheme Perplexity 11.80% 12.07%Hunspell (native) 8.29% 3.76%Hunspell (English) 9.56% 8.66%Wiktionary 3.00% 2.60%Google Hit Counts 4.42% 3.87%

Table 5.13: Portion false positives (Non-Anglicisms wrongly detected)

Precision is defined as∑

true positive∑tested positive

. Therefore on Microsoft-de, with almost fourtimes as many Anglicisms, the portion of false positives is weighted much lesssince the features detect a higher absolute number of Anglicisms (true positives).Microsoft-de, from the IT domain, contains approximately 15% Anglicisms, Spiegel-de, from the general news domain, contains only 4% Anglicisms.

The different f-scores between Microsoft-de and Spiegel-de are due to the differentportion of Anglicisms in the test sets. The domain of the test set and the resultingamount of Anglicisms turn out to play a big role.

5.7 Summary

Figure 5.14 gives an overview of the Anglicism performance for all features. Espe-cially our G2P Confidence Features perform well.

Figure 5.14: Anglicism detection performance of all features

The large performance gap between the two German domains Microsoft-de andSpiegel-de occurs throughout all our features. This is caused by the different portionsof Anglicisms in the test sets, as examined in section 5.6.

6. Comparison and combination offeatures

To include any available information and method, we combine all features whichwe developed in chapter 5. For the combination we experimented with Voting,Decision Tree and Support Vector Machine (SVM) methods, which are described inthe following sections.

Especially for the test sets on which the single features are weak, we get a vastimprovement from the combination. The f-score on the Afrikaans test set is almostdoubled compared to the best single feature.

Results of the feature combinations and an evaluation of the contribution of eachfeature are presented in section 6.2.

Hybrid words, non-English foreign words and abbreviations make up the largest partof our misclassifications. We analyze their effect on the detection performance withan oracle experiment in section 6.2.1.

6.1 Combination of features

6.1.1 Voting

To reach a classification based on all features, all Boolean detection hypotheses ofthe separate features are summed up in a Voting :

1. Separate classification by all features of a word from the test set

2. Calculation of the sum of all separate classification results

3. Final classification by comparison to a threshold for the vote count

In the Voting, we consider a feature classifying the word as English with +1 anda feature classifying the word w as native as −1. An exception is the Wiktionary

36 6. Comparison and combination of features

Lookup Feature (see section 5.4): Its contribution in the Voting can also be 0 if theword is not found in the native Wiktionary.

vote(w) = ClassifiedEnglish(w)− Classifiednative(w)

The final hypothesis of the Voting is based on a threshold T for this vote. A largerT requires more features to classify the word as English before it is classified asEnglish by the Voting. With T = 0 more than half the features need to vote for aword to be English.

The threshold was chosen through a 10-fold cross validation on each test set. Ta-ble 6.1 compares the optimal thresholds, which vary for our different test sets. Es-pecially the detection performance on NCHLT-af is significantly improved if somefeatures are not included in the vote. For the final method for Afrikaans, we thereforeuse a Voting without the Google Hit Counts Feature.



Microsoft-de 75.44% 75.44% 0Spiegel-de 56.78% 61.54% 1NCHLT-af 35.33% 51.66% 4

Table 6.1: Detection performance of Voting with different thresholds

6.1.1.1 Results

The performance of the Voting is shown in table 6.2. The results of all featurecombinations are discussed in section 6.2.

Test setVoting Best single feature

F-score Precision Recall Feature F-Score

Microsoft-de 75.44% 72.54% 78.57% G2P Confidence 70.39%Spiegel-de 61.54% 54.44% 70.77% Google Hit Counts 49.03%NCHLT-af* 62.37% 53.38% 75.00% Hunspell Lookup English 31.90%

Table 6.2: Detection performance of Voting. Voting on NCHLT-af without GoogleHit Counts Feature.

6.1.2 Decision Tree

A decision tree is a tree structure in which leaves represent class labels and the innernodes represent features. This structure – the thresholds and order of the features– can be trained through machine learning given some training data. The resultingdecision tree can then classify other examples and the induced rules can be examinedand interpreted [Quinlan, 1986].

We used the default parameters of Matlab’s decision trees for pruning and other fine-tuning. A decision tree was trained for each test set in a 10-fold cross validation.

The information from the single features is given as Booleans. Like for the Voting,the input from a feature classifying the word as English is +1 and from a featureclassifying the word as native −1. The Wiktionary Lookup Feature can also be 0 ifthe word is not found in the native Wiktionary.

6.1. Combination of features 37

6.1.2.1 Results

The performance of the decision trees is presented in table 6.3. The results of allfeature combinations are discussed in section 6.2.

Test setDecision Tree Best single feature


Microsoft-de 73.10% 79.93% 67.35% G2P Confidence 70.39%Spiegel-de 62.23% 70.39% 55.77% Google Hit Counts 49.03%NCHLT-af 61.83% 53.90% 72.83% Hunspell Lookup English 31.90%

Table 6.3: Detection performance of Decision Tree

Figure 6.1 shows the resulting decision tree for the Microsoft test set. The firstdistinction is made with the G2P Confidence Feature, followed by the Google HitCounts and the Wiktionary Lookup Feature. For the Spiegel-de test set those threefeatures are also first in the decision tree.

Figure 6.1: Decision Tree for Microsoft-de test set. The leaf ”1” denotes native, ”2”English classification

6.1.2.2 Decision Trees based on continuous features

With Boolean features as input for the decision tree, we discard some information.It may help the Anglicism detection if the decision tree receives the ”confidence” ofeach feature’s hypothesis.

Therefore we trained and evaluated decision trees on the continuous values of thesingle features. For the Grapheme Perplexity Feature, the G2P Confidence Featureand the Google Hit Counts Feature, this is the raw difference between the Englishmodel and the model of the matrix language. Instead of using the final classificationof +1 or −1 from these features for the decision tree, we provided the differences


as input. The Hunspell Lookup Features and the Wiktionary Feature remained withtheir +1 / −1 / 0 inputs because they are by design binary features.

Table 6.4 compares the result of this approach to the decision tree based on exclu-sively Boolean input. The continuous feature input improves the detection perfor-mance on Microsoft-de but deteriorates the detection on Spiegel-de. Therefore wedo not use continuous features in our final combination method.

Test setBoolean features Continuous features


Microsoft-de 73.10% 79.93% 67.35% 74.38% 74.41% 74.56%Spiegel-de 62.23% 70.39% 55.77% 52.82% 52.46% 53.76%

Table 6.4: Detection performance of Decision Tree based on continuous input insteadof exclusively Boolean input

6.1.3 Support Vector Machine

We also experimented with Support Vector Machines (SVM) as a powerful state ofthe art classification method [Steinwart and Christmann, 2008]. We used the defaultparameters of Matlab’s SVM with a linear kernel and without any fine-tuning. AnSVM was trained for each test set separately in a 10-fold cross validation.

Like for the Voting and the Decision Tree, the input is Boolean. A feature classifyingthe word as English is passed into the SVM as +1 and a feature classifying the wordas native as −1. The Wiktionary Lookup Feature can also be 0 if the word is notfound in the native Wiktionary.

6.1.3.1 Results

The performance of the SVMs is shown in table 6.5. The results of all featurecombinations are compared and discussed in section 6.2.

Test setDecision Tree Best single feature


Microsoft-de 73.64% 77.89% 69.83% G2P Confidence 70.39%Spiegel-de 58.28% 68.39% 50.77% Google Hit Counts 49.03%NCHLT-af 53.70% 50.00% 58.00% Hunspell Lookup English 31.90%

Table 6.5: Detection performance of SVM

6.2 Results

The performance of the different combination approaches are shown in table 6.6.Figure 6.2 illustrates the comparison. For two out of three test sets our simpleVoting gives the best overall results – although we only use training data to fine-tune the vote threshold, whereas Decision Tree and SVM learned more complexrelations from the input features. As we did not spend much time on fine-tuningthe parameters of the SVM some further improvements may be possible.

6.2. Results 39

Feature Microsoft-de Spiegel-de NCHLT-af

Voting 75.44% 61.54% 62.37%Decision Tree 73.10% 62.23% 61.83%SVM 73.64% 58.28% 53.70%

Best single feature70.39% 49.03% 31.90%

(G2P Confidence) (Google Hit Counts) (Hunspell English)Relative improvement +17.06% +25.90% +44.74%

Table 6.6: F-scores of the feature combinations in comparison and relative improve-ment from the best single feature to the best combination

Figure 6.2: Comparison of the f-scores for the different combination methods andtest sets

The improvements compared to the best single feature are striking, almost doublingthe f-score on the NCHLT-af test set. Especially on the Afrikaans test set, for whichall our single features had poor detection performances, the combination gives amassive relative improvement of 44.74%.

Table 6.7 compares the pairwise combinations of our features on the Microsoft-detest set. Only a combination of the Wiktionary Lookup Feature and either theG2P Confidence Feature or the Grapheme Perplexity Feature can improve the per-formance over the best single feature, which was the G2P Confidence. While theWiktionary Lookup as a separate feature was not among the top performing ones,it provides important additional information and supports the detection in featurecombinations.

This is also shown in figure 6.3, which compares the Voting’s performance for com-binations of any five of the six features. On both German test sets the WiktionaryLookup Features is an important part of the Voting. Apart from the German Hun-spell Lookup, which gives a very minor improvement if left out, all features contributeto good performance of the Voting.


G2P Con-fidence

GraphemePerplexity

HunspellNative

HunspellEnglish

WiktionaryLookup

Google HitCounts

G2P Con-fidence

70,39%

GraphemePerplexity

72,09% 67,17%

HunspellNative

65,07% 65,61% 61,29%

HunspellEnglish

70,23% 68,17% 64,30% 63,65%

WiktionaryLookup

72,21% 72,34% 59,88% 68,44% 52,44%

Google HitCounts

66,33% 66,27% 59,47% 64,15% 63,49% 66,30%

Table 6.7: F-scores of pairwise combinations by Voting on Microsoft-de

Figure 6.3: Relative f-score change of Voting if one feature is left out

6.2.1 Abbreviations, hybrid and other foreign words

A few words are misclassified by all of our features. 0.29% of words from Spiegel-deand 0.97% of words from Microsoft-de are Anglicisms according to our referenceannotation but considered as native by every feature. All of those words are Englishhybrid words.

0.20% of words from Spiegel-de and 0.24% of words from Microsoft-de are not Englishyet all of our features detect them as Anglicisms. All of those words are other foreignwords or abbreviations.

We have annotated hybrid English words, other foreign words and abbreviations inour German test sets, as described in chapter 4. The classification of these wordsis somewhat ambiguous because they are either both German and English (hybridEnglish words) or not clearly any of the two (abbreviations, other foreign words).

6.2. Results 41

In oracle experiments we removed these types of words from the test sets beforeevaluating the f-score. The results show the performance of our features only on theunambiguous test words.

Figure 6.4 compares the results of our Voting when one or all of those word categoriesare filtered. Table 6.8 gives the performance and relative improvements.

Figure 6.4: Performance of Voting after removing difficult word categories from thetest sets

WithoutMicrosoft-de Spiegel-de

F-score Relative impr. F-score Relative impr.

Abbreviations 79.73% 17.48% 62.80% 3.28%Other foreign words 78.29% 11.60% 68.91% 19.18%Hybrid English words 79.06% 14.73% 64.67% 8.15%All three 87.15% 47.70% 74.65% 34.08%

None (no oracle) 75.44% - 61.54% -

Table 6.8: Performance and relative improvement of Voting after removing difficultword categories from the test sets

After manually removing those words, we achieve a relative improvement of up to47.70%. The varying contribution of the different word categories depends on thecomposition of the test set. Spiegel-de has – relative to the whole test set – moreother foreign words and less abbreviations and hybrids.

A lot of potential improvement remains in handling these special word categories.There are possible solutions to automatically filter or handle these and replace ouroracle. We did not experiment with this but word stemming and compound splittingalgorithms seem a good way to deal with hybrid English words. Abbreviations mightbe filtered using regular expressions or an absolute grapheme perplexity threshold.


7. Conclusion

To detect Anglicisms in text of a matrix language, we developed a set of featuresand combined those to further improve the performance.

For evaluation, we built two German test sets. One from the IT domain and onefrom general news articles. We annotated Anglicisms and special word categories inthose test sets to allow for detailed analyses.

Our features for the Anglicism detection are based on:

• Grapheme Perplexity

• G2P Confidence

• Native Hunspell Lookup

• English Hunspell Lookup

• Wiktionary Lookup

• Google Hit Counts

With the G2P Confidence Feature we developed an approach which incorporatesinformation from a pronunciation dictionary. This was our most successful singlefeature.

The Wiktionary Lookup Feature, leveraging web-derived information, is also a newapproach that especially supported the performance of feature combinations.

None of our single features rely on text with Anglicisms annotated for training.The features are instead based on other resources like unannotated word lists ordictionaries.

The combination of the diverse set of features boosted detection performance con-siderably. Especially for the test sets on which the separate features did not bringsatisfactory results, a combination proved very useful.

44 7. Conclusion

7.1 Future work

To develop this approach to Anglicism detection further, we primarily see two areasfor future work:

• Improvement of how abbreviations, hybrid words and other foreign words arehandled

• Development of additional independent features

7.1.1 Improved handling of ambiguous words

As the oracle experiments in section 6.2.1 show, the detection could be significantlyimproved by proper classification of a few special word categories: Abbreviations,hybrid words and other foreign words.

Abbreviations are very different from regular English, German or Afrikaans words.Our features so far are focused exclusively on the Anglicism detection. To improveclassification performance, it may be helpful to introduce another class abbreviationin addition to the classes English and native. Dedicated features can better filterabbreviations into such a new class. In our experiments with the grapheme perplex-ity, we noticed that abbreviations have an exceptionally high absolute perplexity.Also the use of regular expressions to detect abbreviations could be examined.

Hybrid words are mostly compound words with an English part and a part fromthe matrix language, or conjugated verbs with an English word stem. Because theycarry characteristics of both the matrix language and English, they are not only hardto detect but also ambiguous to categorize in the first place. Compound splittingand word stemming algorithms have been developed for many languages and shouldbe integrated in the Anglicism detection system.

This could not only improve detection performance – a clean separation of Englishand native word parts would also be useful for many applications of Anglicismdetection, like the generation of a different pronunciations exclusively for Englishparts.

7.1.2 Additional features

Especially in Named Entity Recognition context information like trigger words orpart-of-speech tags are often used as features. However in this work we focused oncontext-independent features. Evaluating how much one can gain from that entirelynew source of information in the feature combinations would be interesting.

Another idea is the use of translation. Translating the word into English, the tran-lation could be compared to its original form from the matrix language. For thetranslations a translation system available online could be used.

Bibliography

[Ahmed, 2005] Ahmed, B. (2005). Detection of Foreign Words and Names in Writ-ten Text. PhD thesis, Pace University.

[Ahmed et al., 2005] Ahmed, B., Cha, S.-H., and Tappert, C. (2005). Detectionof Foreign Entities in Native Text Using N-gram Based Cumulative FrequencyAddition. In Proceedings Student/Faculty Research Day, CSIS, Pace University,May 6th, 2005.

[Alewine et al., 2011] Alewine, N., Janke, E., Sicconi, R., and Sharp, P. (2011).Systems and Methods for Building a Native Language Phoneme Lexicon Hav-ing Native Pronunciations of the Non-Native Words Derived from Non-NativePronunciations.

[Alex, 2005] Alex, B. (2005). An Unsupervised System for Identifying English In-clusions in German Text. In Proceedings of the ACL Student Research Workshop,pages 133–138.

[Alex, 2006] Alex, B. (2006). Integrating Language Knowledge Resources to Extendthe English Inclusion Classifier to a New Language. In Proceedings of the Fifth In-ternational Conference on Language Resources and Evaluation (LREC’06), pages2431–2436.

[Alex, 2008a] Alex, B. (2008a). Automatic Detection of English Inclusions in Mixed-lingual Data with an Application to Parsing. PhD thesis, University of Edinburgh.

[Alex, 2008b] Alex, B. (2008b). Comparing Corpus-based to Web-based LookupTechniques for Automatic English Inclusion Detection. In Proceedings of the SixthInternational Conference on Language Resources and Evaluation (LREC’08),pages 2693–2697.

[Alex et al., 2007] Alex, B., Dubey, A., and Keller, F. (2007). Using Foreign Inclu-sion Detection to Improve Parsing Performance. In Proceedings of the 2007 JointConference on Empirical Methods in Natural Language Processing and Computa-tional Natural Language Learning, pages 151–160, Prague. Association for Com-putational Linguistics.

[Alex and Grover, 2004] Alex, B. and Grover, C. (2004). An XML-based Tool forTracking English Inclusions in German Text.

[Anderman and Rogers, 2005] Anderman, G. M. and Rogers, M. (2005). In and Outof English: For Better, for Worse? Multilingual Matters.

46 Bibliography

[Andersen, 2005] Andersen, G. (2005). Assessing algorithms for automatic extrac-tion of anglicisms in Norwegian texts. Corpus Linguistics.

[Baluja et al., 2000] Baluja, S., Mittal, V. O., and Sukthankar, R. (2000). ApplyingMachine Learning for High-Performance Named-Entity Extraction. Computa-tional Intelligence, 16(4):586–595.

[Basson and Davel, 2013] Basson, W. D. and Davel, M. H. (2013). Category-BasedPhoneme-To-Grapheme Transliteration. In Interspeech, pages 1956–1960.

[Bikel et al., 1997] Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. (1997).Nymble: a High-Performance Learning Name-finder. In Proceedings of the 5thConference on Applied Natural Language Processing, pages 194–201, Morristown,NJ, USA. Association for Computational Linguistics.

[Bisani and Ney, 2008] Bisani, M. and Ney, H. (2008). Joint-sequence models forgrapheme-to-phoneme conversion. Speech Communication, 50(5):434–451.

[Carnegie Melon University, 2007] Carnegie Melon University (2007). The CMUPronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

[Carstensen et al., 2001a] Carstensen, B., Busse, U., and De Gruyter, W. (2001a).Anglizismen-Worterbuch. 1. A - E, page 53. Gruyter.

[Carstensen et al., 2001b] Carstensen, B., Busse, U., and De Gruyter, W. (2001b).Anglizismen-Worterbuch. 1. A - E. Gruyter.

[Chesley and Baayen, 2010] Chesley, P. and Baayen, R. H. (2010). Predicting newwords from newer words: Lexical borrowings in French. Linguistics, 48(6):1343–1374.

[Duden, 2014] Duden (accessed 23.01.2014). Graphem. http://www.duden.de/rechtschreibung/Graphem.

[Elworthy, 1998] Elworthy, D. (1998). Language Identification With ConfidenceLimits. In Proceedings of the 6th Annual Workshop on Very Large Corpora, pages94–101.

[Engelbrecht and Schultz, 2005] Engelbrecht, H. A. and Schultz, T. (2005). RapidDevelopment of an Afrikaans-English Speech-to-Speech Translator. In Proceedingsof International Workshop of Spoken Language Translation.

[Filimonov et al., 2010] Filimonov, D., Parada, C., Dredze, M., and Jelinek, F.(2010). Contextual Information Improves OOV Detection in Speech. In Hu-man Language Technologies: The 2010 Annual Conference of the North AmericanChapter of the Association for Computational Linguistics, pages 216–224.

[Giguet, 1995] Giguet, E. (1995). Multilingual Sentence Categorization according toLanguage. In Proceedings of the European Chapter of the Association for Compu-tational Linguistics SIGDAT Workshop ”From text to tags : Issues in MultilingualLanguage Analysis”, pages 73–76.

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

http://www.duden.de/rechtschreibung/Graphem

http://www.duden.de/rechtschreibung/Graphem

Bibliography 47

[Grefenstette and Nioche, 2000] Grefenstette, G. and Nioche, J. (2000). Estima-tion of English and non-English Language Use on the WWW. In In Recherched’Information Assistee par Ordinateur (RIAO), pages 237–246.

[Haspelmath, 2007] Haspelmath, M. (2007). Loanword typology : Steps toward asystematic cross-linguistic study of lexical borrowability. In Aspects of LanguageContact: New Theoretical, Methodological and Empirical Findings with SpecialFocus on Romancisation Processes, pages 1–21. Gruyter.

[Heerden et al., 2012] Heerden, C. V., Davel, M. H., and Barnard, E. (2012). Thesemi-automated creation of stratified speech corpora. In Proceedings of theTwenty-Fourth Annual Symposium of the Pattern Recognition Association ofSouth Africa, Pretoria, South Africa.

[Indurkhya and Damerau, 2010] Indurkhya, N. and Damerau, F. J. (2010). Hand-book of Natural Language Processing, Second Edition. Chapman & Hall/CRCmachine learning & pattern recognition series. Taylor & Francis.

[Jeong et al., 1999] Jeong, K. S., Myaeng, S. H., Lee, J. S., and Choi, K.-S. (1999).Automatic identification and back-transliteration of foreign words for informationretrieval. Information Processing and Management, 35:523–540.

[Kang and Choi, 2002] Kang, B.-j. and Choi, K.-s. (2002). Effective foreign wordextraction for Korean information retrieval. Information Processing and Manage-ment, 38.

[Klein et al., 2003] Klein, D., Smarr, J., Nguyen, H., and Manning, C. D. (2003).Named entity recognition with character-level models. In Proceedings of the 7thConference on Natural Language Learning at HLT-NAACL, volume 4, pages 180–183, Morristown, NJ, USA. Association for Computational Linguistics.

[Kundu and Chandra, 2012] Kundu, B. and Chandra, S. (2012). Automatic Detec-tion of English Words in Benglish Text. In Proceedings of the 4th InternationalConference on Intelligent Human Computer Interaction (IHCI) 2012.

[Lin and Wu, 2009] Lin, D. and Wu, X. (2009). Phrase Clustering for DiscriminativeLearning. In Proceedings of the 47th Annual Meeting of the ACL and the 4thIJCNLP of the AFNLP, pages 1030–1038, Singapore.

[Mani et al., 1993] Mani, I., Macmillan, T. R., Luperfoy, S., Lusher, E. P., andLaskowski, S. J. (1993). Identifying Unknown Proper Names in Newswire Text.In Proceedings of the Workshop on Acquisition of Lexical Knowledge from Text,pages 44–54.

[Mansikkaniemi and Kurimo, 2012] Mansikkaniemi, A. A. and Kurimo, M. (2012).Unsupervised Vocabulary Adaptation for Morph-based Language Models. In Pro-ceedings of the HLT-NAACL 2012 Workshop: Will We Ever Really Replace theN-gram Model? On the Future of Language Modeling for HLT, pages 37–40.

[Matthews, 2007] Matthews, P. H. (2007). The Concise Oxford Dictionary of Lin-guistics. Opr Series. OUP Oxford.

48 Bibliography

[Miller et al., 2004] Miller, S., Guinness, J., and Zamanian, A. (2004). Name Tag-ging with Word Clusters and Discriminative Training. In Proceedings of HLT,pages 337–342.

[Munro et al., 2003] Munro, R., Ler, D., and Patrick, J. (2003). Meta-learning or-thographic and contextual models for language independent named entity recog-nition. In Proceedings of the seventh conference on Natural language learning atHLT-NAACL 2003 -, volume 4, pages 192–195, Morristown, NJ, USA. Associationfor Computational Linguistics.

[Novak et al., 2011] Novak, J., Yang, D., Minematsu, N., and Hirose, K. (2011). Ini-tial and Evaluations of an Open Source WFST-based Phoneticizer. The Universityof Tokyo, Tokyo Institute of Technology.

[Novak, 2012] Novak, J. R. (2012). Phonetisaurus ReadMe. https://code.google.com/p/phonetisaurus/wiki/ReadMe.

[Novak et al., 2012] Novak, J. R., Minematsu, N., and Hirose, K. (2012). WFST-based Grapheme-to-Phoneme Conversion: Open Source Tools for Alignment,Model-Building and Decoding. In Proceedings of the 10th International Work-shop on Finite State Methods and Natural Language Processing, pages 45–49.

[Pfeifer et al., 1996] Pfeifer, U., Poersch, T., and Fuhr, N. (1996). Retrieval Effec-tiveness of Proper Name Search Methods. Information Processing & Management,32(6):667–679.

[Powers, 2011] Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of MachineLearning Technologies, 2(1):37–63.

[Quinlan, 1986] Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learn-ing, 1(1):81–106.

[Schlippe et al., 2013] Schlippe, T., Ochs, S., and Schultz, T. (2013). Web-basedtools and methods for rapid pronunciation dictionary creation. Speech Commu-nication, 56:101–118.

[Schlippe et al., 2012] Schlippe, T., Ochs, S., Vu, N. T., and Schultz, T. (2012).Automatic Error Recovery for Pronunciation Dictionaries. In Interspeech.

[Schultz et al., 2013] Schultz, T., Vu, N. T., and Schlippe, T. (2013). GlobalPhone:A multilingual text & speech database in 20 languages. In Proceedings of the 38thInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP),pages 8126–8130, Vancouver, Canada. IEEE.

[Sinha and Thakur, 2005] Sinha, R. M. K. and Thakur, A. (2005). Machine Trans-lation of Bi-lingual Hindi-English (Hinglish) Text. In 10th Machine Translationsummit (MT Summit X), pages 149–156, Phuket, Thailand.

[Steinwart and Christmann, 2008] Steinwart, I. and Christmann, A. (2008). SupportVector Machines. Information Science and Statistics. Springer.

https://code.google.com/p/phonetisaurus/wiki/ReadMe

https://code.google.com/p/phonetisaurus/wiki/ReadMe

Bibliography 49

[Stolcke, 2002] Stolcke, A. (2002). SRILM - An Extensible Language ModelingToolkit. In Proceedings of the 7th International Conference on Spoken LanguageProcessing (ICSLP’02).

[Toole, 2000] Toole, J. (2000). Categorizing Unknown Words: Using Decision Treesto Identify Names and Misspellings. In Proceedings of the 6th Conference onApplied Natural Nanguage Processing, pages 173–179, Morristown, NJ, USA. As-sociation for Computational Linguistics.

[Tran Tri et al., 2007] Tran Tri, Q., Thao Pham, T. X., Hung Ngo, Q., Dinh, D.,and Collier, N. (2007). Named Entity Recognition in Vietnamese Documents.Progress in Informatics, (4):5.

[van der Auwera and Konig, 1994] van der Auwera, J. and Konig, E. (1994). TheGermanic Languages. Arguments of the Philosophers. Routledge.

[Vu et al., 2013] Vu, N. T., Adel, H., and Schultz, T. (2013). An Investigation ofCode-Switching Attitude Dependent Language Modeling. In Proceedings of the1st International Conference on Statistical Language and Speech Processing (SLSP2013).

[Wolinski et al., 1995] Wolinski, F., Vichot, F., and Dillet, B. (1995). AutomaticProcessing of Proper Names in Texts. In Proceedings of the 7th Conference onEuropean Chapter of the Association for Computational Linguistics (EACL ’95),pages 23–30, Morristown, NJ, USA. Association for Computational Linguistics.

50 Bibliography

Date post:	24-Nov-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times