Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 213 times |
Download: | 0 times |
A Splitter for German compound words
Pasquale ImbembaFree University of Bozen-BolzanoSupervisor: Dr. Raffaella Bernardi
2
Scenario
User inputCompound words in German
● Problem for IR-retrieval of German books-no direct keyword matching
● Problem for CLIR-retrieval of IT & EN books-No direct translation in dictionary
3
Problem: German compound words
Compounding is productive: Combine pre-existing morphemes to form a new word (aka Univerbierung)
Compounds of nouns most frequent cases● User input may not be in the lexicon used by CLIR search
engines Donau + Dampf + Schiff + Fahrt
(tr.: Steam navigation on the Danube)● User input may be a lexicalized “compound” word
Malerei (tr.: painting) no: Maler+Ei (tr.: painter and egg)● Hence, need of a splitter to handle both cases● Furthermore, language is in continuous evolution
(neologism); need of constantly up-to-date lexical resources
4
State of the art
● TAGH (Berlin-Brandenburg Academy of Sciences / University of Potsdam)
Weighted FSA: choose combination with least cost● MORPHY (University of Paderborn)
Reduce to base form and affixes, look them up● MORPA (Tilburg University)
Probabilistic calculus to determine segmentation● De Rijke/Monz (University of Amsterdam)
Shallow approach● Given a word, if substring is in lexicon, subtract it.
Repeat until no substring is left.
5
Tools
● Splitter Mechanism to segment nouns
● Implemented, evaluated and improved De Rijke/Monz algorithm using Java
● Lexicon Morphy (57,000 nouns), dated (Lezius)
deWaC (440,000 nouns), recent (Baroni & Kilgarriff)● Lexical resource to execute lookup onto
Extracted nouns from Morphy & deWaC Regular Expression filtering on deWaC Resources indexed with Lucene
6
De Rijke/Monz algorithm
Split (word)For i := 1 to length-1 do
if substring(0,i)isInNounLex && split(substr(i+1,length) != “ “ dor = split(substr(i+1,length)return concat (substr(1,i),+,r)
if (isInNounLex(word))return word;
elsereturn ““;
Ö l p r e i s
p r e i s p r e i s
r = split(substr,i to length)
p r e i s
r = preis
p r e i s
P r e i sÖ l
7
Enhanced Splitter workflow
● Cascading lexical resources Increases split
correctness Improves overall
correctness● Lookup first
Lexicalized elements Reduces amount of
incorrect splits
8
Splitter diagram
9
MuSiL IntegrationQuery Input Donaudampfschifffahrt
Name Recognition Donau Dampfschifffahrt
Morphological Analysis Dampfschifffahrt_N
Multilingual Dictionary
Multiword recognition Dampfschifffahrt_N
EN: vapour_N | steam_N (...)
EN: ship_N | (...)
IT: vapore_nm | (...)
IT: nave_nf | (...)
Split and Translate
Splitter EN: drive_N | navigation_N (...)
IT: guida_nf | navigazione_nf (...)
1 2
3
Multilingual Thesaurus
10
Evaluation
De Rijke/Monz Splitter Our SplitterdeWaC Morphy deWaC Morphy
Lexicon used 6.201 16.141 6.207 16.135Total splits 2.723 4.851 2.022 4.517Total non splits 3.478 11.290 4.185 11.618Total NS wrong 1.322 50 1.404 50Split due to lexical error 2.383 66 1.871 45Split due to logic error 9 1.067 7 864Correct splits 331 3.718 144 3.608Split correctness 12,16% 76,64% 7,12% 79,88%Correct elements 2.487 14.958 2.925 15.176Correctness 40,11% 92,67% 47,12% 94,06%
●Total correctness improved● By increasing the amount of non splits with deWaC and Morphy
11
Complexity of the split function
• De Rijke/Monz– Best case:
• We scan the input word from first to last position
– Worst case:• Calls to split• Exponential growth
• Our splitter:– Best case:
• We find the word immediately to exist in the lexical resources of nouns
– Worst case:• Execute function
recursively every time we encounter a word in the lexicon and the remaining substring is not empty (see De Rijke/Monz)
21
212)(
1
0
nn
i
i OnT
1)( OnT )()1()( nOnnT
12
Performance on MuSiL
●Increased amount of retrieved documents● More relevant documents are top ranked
Without splitter component With splitter component
DE IT EN Precision DE IT EN Precision
Abenteuer+Geschichten 1 0 0 100% 10 57 2 51%
Beruf+Orientierung 13 0 0 100% 13 392 24 43%
Kommunikation+Politik - - - - 28 69 317 47%
Wert+Papier+Handel+Gesetz 4 0 0 100% 0 36 90 17%
Doppel+Besteuerung+Abkommen 4 0 2 100% 0 0 15 100%
Aufmerksamkeit+Defizit+Syndrom 44 0 25 36% 3 0 8 73%
Hirn+Leistung+Training 15 0 0 100% 4 48 189 46%
Kunst+Erziehung+Bewegung 1 0 0 100% 144 478 251 36%
Emotion+Regulierung 1 0 0 100% 0 3 66 71%
Unternehmen+Netzwerke 8 0 25 61% 73 61 677 27%
13
Conclusion and future work
● Good: Cascade method Deal with lexicalized elements
● Open topics: Choose correct segmentation among
alternatives Metrics for correctness of segmentation
● Weights, probability …