The Challenge of MorphologyMapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers)
Allkütulekefun
The Challenge of MorphologyMapudungun
-ke -fu -n-leAllkütu
The Challenge of MorphologyMapudungun
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
The Challenge of MorphologyMapudungun
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
I
The Challenge of MorphologyMapudungun
I used to
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
The Challenge of MorphologyMapudungun
I used to listen
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
The Challenge of MorphologyMapudungun
I used to listen
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
Tasks for Morphology• Segment Words• Map Morphemes onto Features
The Challenge of Morphology
Tasks for Morphology
• Segment Words• Map Morphemes
onto Features
• Learn these tasks– unsupervised – from data – for any language
• Paradigm– Set of affixes that interchangeably
attach to a set of stems– English Example
• Regular Verbs: Ø.s.ing.ed• Regular Adj: Ø.er.est
Leverage the Natural Structure of Morphology
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.sblamesolve
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.sblamesolve
Ø.s.dblame
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.sblamesolve
Ø.s.dblame
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.sblamesolve
Ø.s.dblame
sblameroamsolve
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.sblamesolve
Ø.s.dblame
sblameroamsolve
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.sblamesolve
Ø.s.dblame
sblameroamsolve
e.esblamsolv
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.sblamesolve
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.s.dblame
sblameroamsolve
e.esblamsolv
e.esblamsolv
e.edblam
esblamsolv
Ø.s.dblame
Ø.sblamesolve
Øblameblamesblamedroams
roamedroaming
solvesolvessolving
e.es.edblam
edblamroam
dblameroame
Ø.dblame
s.dblame
sblameroamsolve
es.edblam e
blamsolv
me.mesbla
me.medbla
mesbla
me.mes.medbla
medblaroa
mes.medbla
mebla
a.as.o.os43
african, cas, jurídic, l, ...
a.as.o.os.tro1
cas
a.as.os50
afectad, cas, jurídic, l, ...
a.as.o59
cas, citad, jurídic, l, ...
a.o.os105
impuest, indonesi, italian, jurídic, ...
a.as199
huelg, incluid, industri,
inundad, ...
a.os134
impedid, impuest, indonesi,
inundad, ...
as.os68
cas, implicad, inundad, jurídic, ...
a.o214
id, indi, indonesi,
inmediat, ...
as.o85
intern, jurídic, just, l, ...
a.tro2
cas.cen
a1237
huelg, ib, id, iglesi, ...
as404
huelg, huelguist, incluid,
industri, ...
os534
humorístic, human, hígad,
impedid, ...
o1139
hub, hug, human,
huyend, ...
tro16
catas, ce, cen, cua, ...
as.o.os54
cas, implicad, jurídic, l, ...
o.os268
human, implicad, indici,
indocumentad, ...
Spanish Newswire Corpus40,011 Tokens
6,975 Types
19
a.as.o.os43
african, cas, jurídic, l, ...
a.as.o.os.tro1
cas
a.as.os50
afectad, cas, jurídic, l, ...
a.as.o59
cas, citad, jurídic, l, ...
a.o.os105
impuest, indonesi, italian, jurídic, ...
a.as199
huelg, incluid, industri,
inundad, ...
a.os134
impedid, impuest, indonesi,
inundad, ...
as.os68
cas, implicad, inundad, jurídic, ...
a.o214
id, indi, indonesi,
inmediat, ...
as.o85
intern, jurídic, just, l, ...
a.tro2
cas.cen
a1237
huelg, ib, id, iglesi, ...
as404
huelg, huelguist, incluid,
industri, ...
os534
humorístic, human, hígad,
impedid, ...
o1139
hub, hug, human,
huyend, ...
tro16
catas, ce, cen, cua, ...
as.o.os54
cas, implicad, jurídic, l, ...
o.os268
human, implicad, indici,
indocumentad, ...
20
Suffixes
Stems
Level 5 = 5 suffixes
Stem Type Count
a.as.o.os43
african, cas, jurídic, l, ...
Adjective Inflection Class
21
a.as.o.os.tro1
cas
a.tro2
cas.cen
tro16
catas, ce, cen, cua, ...
a.as.os50
afectad, cas, jurídic, l, ...
a.as.o59
cas, citad, jurídic, l, ...
a.o.os105
impuest, indonesi, italian, jurídic, ...
a.as199
huelg, incluid, industri,
inundad, ...
a.os134
impedid, impuest, indonesi,
inundad, ...
as.os68
cas, implicad, inundad, jurídic, ...
a.o214
id, indi, indonesi,
inmediat, ...
as.o85
intern, jurídic, just, l, ...
a1237
huelg, ib, id, iglesi, ...
as404
huelg, huelguist, incluid,
industri, ...
os534
humorístic, human, hígad,
impedid, ...
o1139
hub, hug, human,
huyend, ...
as.o.os54
cas, implicad, jurídic, l, ...
o.os268
human, implicad, indici,
indocumentad, ...
From the spurious suffix “tro”
a.as.o.os.tro1
cas
a.tro2
cas.cen
tro16
catas, ce, cen, cua, ...
a.as.o.os43
african, cas, jurídic, l, ...
a.as.os50
afectad, cas, jurídic, l, ...
a.as.o59
cas, citad, jurídic, l, ...
a.o.os105
impuest, indonesi, italian, jurídic, ...
a.as199
huelg, incluid, industri,
inundad, ...
a.os134
impedid, impuest, indonesi,
inundad, ...
as.os68
cas, implicad, inundad, jurídic, ...
a.o214
id, indi, indonesi,
inmediat, ...
as.o85
intern, jurídic, just, l, ...
a1237
huelg, ib, id, iglesi, ...
as404
huelg, huelguist, incluid,
industri, ...
os534
humorístic, human, hígad,
impedid, ...
o1139
hub, hug, human,
huyend, ...
as.o.os54
cas, implicad, jurídic, l, ...
o.os268
human, implicad, indici,
indocumentad, ...
22
Dec
reas
ing
Ste
m C
ount
Incr
easi
ng S
uffix
Cou
nt
Basic Search Procedure
Scaling Up
• Scaling Up– 1 Million word corpus– Network built on demand
• New Approach to Search– High Recall initial search– Weed the results to improve precision
• Results– Boost Recall of Suffixes in Spanish
• from 0.5 to 0.8– But very low precision currently
Top Examples of Selected Schemes
1 Million Words of Spanish
Suffixes # of Stems
Part of Speech
Ø.s 2 Noun
a.as.o.os 4 Adjective
Ø.ba.ban.da.das.do.dos.n.ndo.r.ron.rse.rá.rán.ría.rían 16 Verb (-ar)
Ø.es 2 Noun
a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.e.en.ó
18 Verb (-ar)
Ø.a.emos.on.se.á.án.ía.ían 9 Verb (-ar/-er/-ir)
ones.ón 2 Nominalization
l.les 2 Noun
Next Steps for Morphology Induction
• Clean the Selected Schemes– Current Work
• Convert Paradigms into a Segmenter– Soon
• Agglutinative sequences of suffixes– Soon
• Learn Mappings from Morphemes to Features– Future Goal