A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor:...

A Splitter for German compound words

Pasquale ImbembaFree University of Bozen-BolzanoSupervisor: Dr. Raffaella Bernardi

2

Scenario

User inputCompound words in German

● Problem for IR-retrieval of German books-no direct keyword matching

● Problem for CLIR-retrieval of IT & EN books-No direct translation in dictionary

3

Problem: German compound words

Compounding is productive: Combine pre-existing morphemes to form a new word (aka Univerbierung)

Compounds of nouns most frequent cases● User input may not be in the lexicon used by CLIR search

engines Donau + Dampf + Schiff + Fahrt

(tr.: Steam navigation on the Danube)● User input may be a lexicalized “compound” word

Malerei (tr.: painting) no: Maler+Ei (tr.: painter and egg)● Hence, need of a splitter to handle both cases● Furthermore, language is in continuous evolution

(neologism); need of constantly up-to-date lexical resources

4

State of the art

● TAGH (Berlin-Brandenburg Academy of Sciences / University of Potsdam)

Weighted FSA: choose combination with least cost● MORPHY (University of Paderborn)

Reduce to base form and affixes, look them up● MORPA (Tilburg University)

Probabilistic calculus to determine segmentation● De Rijke/Monz (University of Amsterdam)

Shallow approach● Given a word, if substring is in lexicon, subtract it.

Repeat until no substring is left.

5

Tools

● Splitter Mechanism to segment nouns

● Implemented, evaluated and improved De Rijke/Monz algorithm using Java

● Lexicon Morphy (57,000 nouns), dated (Lezius)

deWaC (440,000 nouns), recent (Baroni & Kilgarriff)● Lexical resource to execute lookup onto

Extracted nouns from Morphy & deWaC Regular Expression filtering on deWaC Resources indexed with Lucene

6

De Rijke/Monz algorithm

Split (word)For i := 1 to length-1 do

if substring(0,i)isInNounLex && split(substr(i+1,length) != “ “ dor = split(substr(i+1,length)return concat (substr(1,i),+,r)

if (isInNounLex(word))return word;

elsereturn ““;

Ö l p r e i s

p r e i s p r e i s

r = split(substr,i to length)

p r e i s

r = preis

p r e i s

P r e i sÖ l

7

Enhanced Splitter workflow

● Cascading lexical resources Increases split

correctness Improves overall

correctness● Lookup first

Lexicalized elements Reduces amount of

incorrect splits

8

Splitter diagram

9

MuSiL IntegrationQuery Input Donaudampfschifffahrt

Name Recognition Donau Dampfschifffahrt

Morphological Analysis Dampfschifffahrt_N

Multilingual Dictionary

Multiword recognition Dampfschifffahrt_N

EN: vapour_N | steam_N (...)

EN: ship_N | (...)

IT: vapore_nm | (...)

IT: nave_nf | (...)

Split and Translate

Splitter EN: drive_N | navigation_N (...)

IT: guida_nf | navigazione_nf (...)

1 2

3

Multilingual Thesaurus

10

Evaluation

De Rijke/Monz Splitter Our SplitterdeWaC Morphy deWaC Morphy

Lexicon used 6.201 16.141 6.207 16.135Total splits 2.723 4.851 2.022 4.517Total non splits 3.478 11.290 4.185 11.618Total NS wrong 1.322 50 1.404 50Split due to lexical error 2.383 66 1.871 45Split due to logic error 9 1.067 7 864Correct splits 331 3.718 144 3.608Split correctness 12,16% 76,64% 7,12% 79,88%Correct elements 2.487 14.958 2.925 15.176Correctness 40,11% 92,67% 47,12% 94,06%

●Total correctness improved● By increasing the amount of non splits with deWaC and Morphy

11

Complexity of the split function

• De Rijke/Monz– Best case:

• We scan the input word from first to last position

– Worst case:• Calls to split• Exponential growth

• Our splitter:– Best case:

• We find the word immediately to exist in the lexical resources of nouns

– Worst case:• Execute function

recursively every time we encounter a word in the lexicon and the remaining substring is not empty (see De Rijke/Monz)

21

212)(

1

0

nn

i

i OnT

1)( OnT )()1()( nOnnT

12

Performance on MuSiL

●Increased amount of retrieved documents● More relevant documents are top ranked

Without splitter component With splitter component

DE IT EN Precision DE IT EN Precision

Abenteuer+Geschichten 1 0 0 100% 10 57 2 51%

Beruf+Orientierung 13 0 0 100% 13 392 24 43%

Kommunikation+Politik - - - - 28 69 317 47%

Wert+Papier+Handel+Gesetz 4 0 0 100% 0 36 90 17%

Doppel+Besteuerung+Abkommen 4 0 2 100% 0 0 15 100%

Aufmerksamkeit+Defizit+Syndrom 44 0 25 36% 3 0 8 73%

Hirn+Leistung+Training 15 0 0 100% 4 48 189 46%

Kunst+Erziehung+Bewegung 1 0 0 100% 144 478 251 36%

Emotion+Regulierung 1 0 0 100% 0 3 66 71%

Unternehmen+Netzwerke 8 0 25 61% 73 61 677 27%

13

Conclusion and future work

● Good: Cascade method Deal with lexicalized elements

● Open topics: Choose correct segmentation among

alternatives Metrics for correctness of segmentation

● Weights, probability …

Date post:	19-Dec-2015
Category:	Documents
View:	213 times
Download:	0 times

A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor:...

Documents