1
On the Ambiguity of Serbian Texts and Methods to
disambiguate it
Cvetana Krstev, Duško Vitas,
University of Belgrade
8th Intex/Nooj Workshop
2
What is the ambiguity?
• the assignment of different lemmas• the assignment of different grammatical categories
3
The ambiguity in Serbian
In Serbian many word forms are homographs although not homophones—stress marks are not recorded:gőre adv. upgőrē adv. worsegòrē P3s goreti,V+Ek to burngòre A3sgòrē P3s gorjeti,V+Ijk to burngòre A3sgòre fs2 gora forest
short long
up ő ô
down ò ó
gore
4
The ambiguity in Serbian (2)rodoslovna,rodoslovni.A2+PosQ:akms2g:akms4v:aefs1g:aefs5g:akns2g:aenp1g:aenp4g:aenp5g
rodoslovne,rodoslovni.A2+PosQ:aemp4g:aefs2g:aefp1g:aefp4g:aefp5g
rodoslovni,rodoslovni.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g
rodoslovnih,rodoslovni.A2+PosQ:aemp2g:aefp2g:aenp2g
rodoslovnim,rodoslovni.A2+PosQ:aems6g:aemp3g:aemp6g:aemp7g:aefp3g:aefp6g:aefp7g:aens6g:aenp3g:aenp6g:aenp7g
rodoslovnima,rodoslovni.A2+PosQ:aemp3g:aemp6g:aemp7g:aefp3g:aefp6g:aefp7g:aenp3g:aenp6g:aenp7g
rodoslovno,rodoslovni.A2+PosQ:aens1g:aens4g:aens5g
rodoslovnog,rodoslovni.A2+PosQ:adms2g:adms4v:adns2g
rodoslovnoga,rodoslovni.A2+PosQ:adms2g:adms4v:adns2g
rodoslovnoj,rodoslovni.A2+PosQ:aefs3g:aefs7g
rodoslovnom,rodoslovni.A2+PosQ:adms3g:adms7g:aefs6g:adns3g:adns7g
…
← 9 sets of grammatical categories
e : form is the same for definite, indefinite
g : form is the same for animate and inanimate
5
Disambiguation process
• Reconstructing word forms
• Using filter dictionaries
• Using restricted dictionaries
• Using dictionaries of compounds
• Using disambiguation grammars
6
Reconstructing word forms – date adverbial phrases
7
Reconstructing word forms – date adverbial phrases (2)
i izdavanxem YUBA kartica 20. februara 2002. godine.celog sistema. Zato je josx pocyetkom 1996. godine jedani www.plivamed.net. U petom mjesecu 2001.godine smo oformlxcxe biti odrzxan u novembru ove godine u Neumu, a za prvog
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
54196 86915 1.60 174079 3.21
54126 86768 1.60 173727 3.21
8
Reconstructing word forms – forms written with digits, etc.
9
Reconstructing word forms – forms written with digits(2)
sxkovi iznosili oko 500 hilxada maraka. Znacyajna usxteda poput SAP-ovog ili IBM-ovog, dobijate i organizaciju firmecyelicyne industrije 1890-ih nije postojao. Ali, poznata jesveta drma tezxinom od 81,7 milijardi dolara u 160 zemalxa,
odnosno ukupno bezmalo pola milijarde (464 miliona)! Predxe
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
54126 86768 1.60 173727 3.21
54064 86507 1.60 173693 3.21
10
Using filter dictionaries
mi,ja.PRO01+Prs:sx3i
mi,mi.PRO03+Prs:px1r
mi,miti.V35+Imperf+Tr+Iref+Ref:Ays:Azs
li,li.PAR
li,liti.V98+Imperf+Tr+It+Iref:Ays:Azs
11
Using filter dictionaries (2)
Very cautious filter dictionary with only 41 entries:
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
54064 86507 1.60 173693 3.21
53858 81607 1.52 166908 3.10
12
Using restricted dictionaries
• Dictionaries contain lemmas for both standard pronunciations – Ekavian and Ijekavian. Text, however, are usually written in only one.
• Dictionaries contain lemmas for both Serbian and Croatian language (or variant of Serbo-Croatian)
13
Using restricted dictionaries (2)
crvene,crven.A17+Col:aemp4g:aefs2g:aefp1g:aefp4g:aefp5g
crvene,crveneti.V547+Imperf+It+Iref+Ref+Ek:Pzp:Ays:Azs
crvene,crveniti.V54+Imperf+Tr+Iref:Pzp
crvene,crvenxeti.V747+Imperf+It+Iref+Ref+Ijk:Pzp
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
53858 81607 1.52 166908 3.10
53809 80890 1.50 165546 3.08
14
Using dictionary of compounds
bez obzira na,bez obzira na.PREP+C+Ncn+p4bez,bez.PREP+p2na,na.INTna,na.PREP+p4+p7obzira,obzir.N1:ms2q:mp2qobzira,obzirati.V519+Imperf+It+Ref:Ays:Azs
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
53809 80890 1.50 165546 3.08
48698 72597 1.49 147714 3.03
15
Using disambiguation grammars – positional constraint
It is interjection, if it is followed by an exclamation mark.
16
Using disambiguation grammars – positional constraint (2)
After sentence or phrase boundary, “mi” and “ti” are personal pronouns in nominative case (after other possibilities were excluded)
17
Using disambiguation grammars – sequential constraint
“da” is a conjunction (and not a form of a verb dati – to give – if is followed by an auxiliary verb in clitic form)
18
Using disambiguation grammars – sequential and positional constraints
sxargarepe evropska unija ne samo da je prihvatila nasxu ida,.CONJda,.ADVda,.INTda,.PARda,dati.V103+Perf+Tr+Iref+Ref:Pzs:Ays:Azs
Forms Assoc. lemmas
ratio Lemmas + categ.
ratio
48698 72595 1.49 147714 3.03
48698 71809 1.47 146491 3.01
19
Using disambiguation grammars – agreement
An adjective, possessive pronoun or numeral has to agree in gender, number, and case with a noun that follows
20
Using disambiguation grammars – agreement (2)
povecxati nxegov proboj u regionu. Rumunska proporcijau,.PREP+p2u,.PREP+p4u,.PREP+p7regionu,region.N1:ms3qregionu,region.N1:ms7q
Forms Assoc. lemmas
ratio Lemmas + categ.
ratio
48698 71809 1.47 146491 3.01
48698 66284 1.36 129167 2.65
21
Using disambiguation grammars – agreement of personal names
Special rules of the agreement of first name and surname
22
Using disambiguation grammars – agreement (2)
raspalio je Mladxan Dinkicx sxakom o okrugli sto "Platne kartice -
Mladxan,Mladxan.N1002+Hum+NProp+First+SR:ms1vMladxan,mladxan.A7:akms1g:akms4qDinkicx,Dinkicx.N28+NProp+Hum+Last+SR:ms1v
Forms Assoc. lemmas
ratio Lemmas + categ.
ratio
48698 66284 1.36 129167 3.65
48698 66255 1.36 129101 2.65
23
The order of grammar application
←Apply first
Apply second →
24
Careful construction of grammars
Syntactic ambiguity:Zalagacxu se da ti trosxkovi budu minimalni.
I will do my best to minimize these expences.I will do my best to minimize your expences.
Although some cases are much more frequent...Kličke je bio voljan da da automobil.
Klicke was willing to give the car.
Mislio sam da ti tvoja gospođa ne da da je viđaš. I thought that your misses is not giving to you to see her.
25
Thank you!