Building the Turkish Discourse Bank
Deniz Zeyrek and Cem Bozşahin Cognitive Science, Informatics Institute, METU, Ankara
PDTB 2012 Workshop, April 30-May 2 2012, IRCS, University of Pennsylvania
Joint work with Işın Demirşahin, Ayışığı B. Sevdik-Çallı, İhsan Yalçınkaya, Ümit Deniz Turan, Ruken
Çakıcı, Hale Balaban-Ögel, Berfin Aktaş
MiddleEastTechnicalUniversity
PDTB style annotation (Prasad et al., 2008)
Guidelines (a set of explicit connectives to annotate)
Discourse use of connectives Abstract object criterion (Asher, 1993)
predicate-argument structure (Arg1-Arg2-Conn)
Minimality principle 4/30/12 3
Turkish Discourse Bank (TDB): first release February, 2011
8483 annotations on ~400,000 word-corpus
METU Turkish Corpus: ~2 million words
Can be requested from: www.tdb.ii.metu.edu.tr.
includes raw text files, annotation files, annotation guidelines, and a browser
4/30/12 4
The tagset
4/30/12 5
Conn Connective
Arg1 First argument of the connective
Arg2 Second argument of the connective
Supp1 Supplement to the first argument
Supp2 Supplement to the second argument
Shared The subject, object or adverbial phrase shared by a relation
Shared supp Supplement for the shared material
Mod Modifier of the connective or the modifier of the relation
Annotation and browser tools
Annotation tool – produces XML files as annotation data – uses stand-off annotation method
Browser supports regexp search across files – connective-arg visualization
4/30/12 6
Markables
Coordinating conjunctions Subordinators
– complex subordinators (için ‘for’, karşın ‘although, despite’)
– simplex subordinators, i.e. converbs (-IncA ‘when,’ –ken ‘while, now that’)
Discourse adverbials
4/30/12 9
A simplex subordinator: -dEn ‘since’
Arg2-Conn-Arg1
Ne kadar üzüleceğini bil-DIĞ-İm-Den, How much you would worry I knew-Dık-
Agr-Conn,
yarına erteledim yolculuğumu. I postponed my journey for tomorrow. 4/30/12 10
A complex subordinator: kadar ‘until’ (two-part connectives)
Arg2-Conn-Arg1
Gergin gövdemi uykulu bir karanlığın içinde hafiflemiş, erimiş duy-Unca-Ya kadar bekledim.
I waited until I felt-Conn-Dat my stressed body relaxing and melting in darkness.
4/30/12 11
Word-order flexibility: the shared tag
Arg2-Conn-Arg1
Sınırlı olmasına rağmen {bu devrimci kongreler}, sarayın değil, halkın demokratik ihtilalinin eseriydiler.
Despite the fact that they were limited, {these revolutionist congresses}were not a result of the empire but the people’s democratic rebellion.
4/30/12 12
Annotation procedure
The regexp search returns various uses of one root – neden ‘reason’ returns:
• subordinator tokens: – V-mEsI neden-iy-le ‘reason+POSS+INS’ (‘V-NOM due to’)
• phrasal expressions: – neden-le reason+INS’- (as in bu nedenle ‘due to this’)
4/30/12 13
POSS:possessiveINS:instrumentalNOM:nominalizingsuffix
77 connective tokens searched
143 connectives (types) annotated including phrasal expressions
4/30/12 14
Agreement statistics: Krippendorff’s alpha (word boundary)
Arg1 Arg2
ama ‘but’ coordinating conj.
0.95630455 0.84201956 ne .. ne ‘neither .. nor’ 1 0.93903977için ’due to, in order to’ subordinator 0.16414972 0.80815035amacıyla ‘for the purpose of’ 0.43203524 0.95423687 böylece ‘thus’
discourse adverbial 0.7870529 0.99896985
tersine ‘on the contrary’ 0.43093222 1
4/30/12 15
Nominalizations as arguments
clauses based on the factive nominalizer –DıK and the nonfactive nominalizer –ACAK,
clauses based on the infinitives –MA or –MAK
clauses formed on the nonfinite nominal marker –ıŞ (Csató, 1998:230).
4/30/12 16
Total number of annotations
4/30/12 17
Single Parallel Total
Coordinating Conjunction
Tokens 15 12 27
Relations 4348 129 4477
Subordinating Conjunction
Tokens 31 1 32
Relations 2285 2 2287
Discourse Adverbial
Tokens 32 18 50
Relations 1152 73 1225
Phrasal Expression
Tokens 40 1 41
Relations 490 4 494
Total
Tokens 118 32 150
Relations 8275 208 8483
Dependency configurations
We found 8 dependency configurations between connectives and their arguments (Aktaş et al. 2010, a la Lee et al. 2006).
One pattern (nested relation) seems novel.
4/30/12 18
Types of dependencies in discourse (Lee et al. 2006)
Tree-structured independent relations full embedding
Non-tree structured shared argument properly contained
argument properly contained
relation nested relations pure crossing
4/30/12 19
1. Independent Relations
{Köpekler} bahçenin öbür tarafında olmalarına rağmen (1) havlamaya başlamışlardı. Zincirleri çözülmemişti, ama (2) her an koparabilirlerdi.
Although(1) on the other side of the garden, {the dogs} had started to bark. (Their) chains were not loose, but (2)(they) could break (them) any minute now.
4/30/12 20
2. Full Embedding
İlgisizliğim seni şaşırtabilir ama (1) üvey babamı görmek istemediğim için (2) yıllardır o eve gitmiyorum.
My lack of interest might surprise you but(1) since(2) I don't want to see my step father, I haven't been to that house for years.
4/30/12 21
ama‘but’(1)
Arg1:mylack..youArg2:since..years
için‘since’(2)
Arg1:Idon’t..fatherArg2:Ihaven’t..years
3. Shared Argument
Çevremde, o an duygularımdaki kargaşa yüzünden sahiplerini seçemediğim gülüşmeler oldu. Sonra (1) babam da güldü. Ama (2) onunkisi keyifli, neredeyse övünçlü bir gülüştü.
There were laughs whose owners I could not identify due to my emotional turmoil at the time. Then (1) my father laughed too. But (2) his was a joyful, almost a boastful laugh.
4/30/12 22
sonra‘then’(1)
Arg1:therewere..TmeArg2:myfather…laugh
but‘ama’(2)
Arg1:thenmy…tooArg2:hiswas..laugh
4. Properly Contained Argument Kapıdan girdi ve (1) “söyler misin, hiç etkilenmedin mi yazdıklarından?” dedi.
“Tersine (2), çok etkilendim.” She entered through the door and (1) asked “Tell me, are you not touched at all
by what he wrote?”. “On the contrary (2), I am very much affected.”
4/30/12 23
ve ‘and’ (1)
Arg1: she entered ..door Arg2: asked … wrote?
tersine ‘on the contrary’ (2)
Arg1: are you ..wrote? Arg2: I am .. affected
5. Properly Contained Relation
Proleterya devrimi, kapitalizmin en fazla geliştiği (dolayısıyla (2) toplumsal çelişkilerin en keskin olduğu) yerde gündeme gelir ve (1) devrim tek ülkede değil Avrupa çapında olanaklıdır.
The proletarian revolution is likely to emerge in places where capitalism is highly developed (thus(2) social conflicts are the sharpest), and (1) the revolution is not likely to happen in a single country, but across Europe.
4/30/12 24
ve‘and’(1)
Arg1: the proletarian … sharpest Arg2: the revolution … Europe
dolayısıyla ‘thus’ (2)
Arg1: the proletarian … developed Arg2: the social … sharpest
6. Nested Relations Ben de sahiplerim açısından şanslı kedilerden sayılırım. Hem iyi baktılar hem de (2)
kediliğimden özveride bulunmamı, özgürlüğümü satmamı hiç istemediler. Ama (1) her kedi bizim kadar şanslı olmayabilir.
I am also one of the lucky cats in terms of my owners. They both took good care of me and (2) didn’t demand that I let go of my cathood and sell out my freedom. But(1)not all cats may be as lucky as we are.
4/30/12 25
but ‘ama’ (1) Arg1: I am .. owners Arg2: not all … we are
hem ... hem de ‘both …and’ (2)
Arg1: they .. me Arg2: didn’t … freedom
7. Pure Crossing Sonra ansızın sesler gelir. Ayak sesleri. Birilerinin ya işi vardır, aceleyle yürürler, ya
koşarlar. O zaman (1) kız katılaşır ansızın. Oğlan da katılaşır ve (2) her koşunun gizli bir isteği var. Bunu biz bilemeyiz.
Then suddenly there are voices. Footsteps. People must be running errands, walking in a hurry or running. Then (1) all of a sudden the girl freezes. The boy also freezes, and (2) each run ought to have a secret wish. That we cannot know.
4/30/12 26
ozaman‘then’ (1)Arg1:thensuddenly…voicesArg2:allofa…freezes
and‘ve’ (2)Arg1:people…runningArg2:eachrun…wish
8. Partial Overlap Schily'nin Ankara'yı 20-21 Haziran'da ziyaret etmesi planlanmış, ancak Türkiye'de
ölüm cezasının o tarihte henüz kaldırılmamış olması nedeniyle (1) sürdürülen teknik müzakerelerde ilerleme sağlanamamış ve (2) ziyaret ertelenmişti.
It was planned that Schily would visit Ankara on June 20-21, however, since(1) capital punishment had not been not abolished then in Turkey, no
progress could be made in the ongoing technical debates, and (2) the visit was postponed.
4/30/12 27
nedeniyle ‘ since’ (1)
Arg1: capital … debates Arg2: no progress … debates
and ‘ve’ (2)
Arg1: it was planned … debates Arg2: the visit .. postponed
Summary
TDB uses PDTB guidelines It annotates:
– explicit connectives – nominalizations as arguments – phrasal expressions – shared tag
4/30/12 28
Shared arguments, nested relations and partially overlapping argument configurations are quite frequent
we cannot appeal to discourse notions like anaphora, attribution or a parenthetical
genuine discourse structures?
4/30/12 29
Prospects
the role of lexical relations frequency of types of complex
dependencies genre-complex dependencies superimposed structures? centers (proved very useful for discourse
use of pro-drop)
4/30/12 30
Bibliography
Asher, N. (1993). Reference to Abstract Objects in Discourse. Kluwer Academic Publishers.
Aktaş, B., Bozşahin, C., and Zeyrek, D. (2010). Discourse Relation Configurations in Turkish and an Annotation Environment. In Proceedings of the Fourth Linguistic Annotation Workshop (LAW IV).
Csató, Éva Á. (1988). Turkish. In Lars Johanson and Éva Á. Csató, editors. (1998). The Turkic Languages (Routledge Language Family Series), pages 203-235, Routledge, London.
Alan Lee, Rashmi Prasad, Aravind Joshi, Nikhil Dinesh and Bonnie Webber. (2006). Complexity of dependencies in discourse. In Proceedings of the 5th InternationalWorkshoponTreebanksand Linguistic Theories.
Christian Lehmann (1988). Towards a typology of clause linkage. In John Haiman and Sandra A.
Thompson, editors. Clause Combining in Grammar and Discourse, pages 181-225, John
Benjamins Publishing, Amsterdam; Philadelphia. Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B.
(2008). The Penn Discourse Treebank 2.0. In Proceedings of the Sixth International Conference on Language Resources and Evaluation. (LREC’08).
4/30/12 33
Most frequent 5 explicit connectives in TDB and the frequency of their non-discourse use
4/30/12 37
Discourse connectives Other uses Total instances
Conn # % # % # %
ve ‘and’ 2112 28.2 5389 71.8 7501 100.0
için ‘because’ 1102 50.9 1063 49.1 2165 100.0
ama ‘but’ 1024 90.6 106 9.4 1130 100.0
sonra ‘later’ 713 56.7 544 43.3 1257 100.0
ancak ‘however’ 419 79.1 111 20.9 530 100.0
The connective için
4/30/12 38
Goaldriveninf(‐mAk)için 510 ‐mA‐POSS‐AGRiçin 239 ‐mAiçin 6 ‐Iş‐POSS‐AGRiçin 6 ‐Işiçin 2 ‐Im‐POSS‐AGRiçin 7 GoalTotal 770 Causedriven
‐dIğI‐AGRiçin 276 ‐(A)cAğI‐AGRiçin 12 Causetotal 288 İçintotal 1058
Phrasal expressions
functionality of the annotation tool
the deictic element is the second argument of the expression morphologically similar to the second argument of a subordinator
4/30/12 39
Subordinator İhaleli sisteme geç-IŞ-IN
ardından bu ihalelere katılmayan 10 kadar firma takibe alındı.
After the shift [shift-Iş-GEN] to the bidding system, legal action was taken against circa 10 firms that did not undertake bidding processes.
Phrasal Expression after the shift [shift-NOM-
GEN]
Bu-nun ardından … After this [this-GEN] …
4/30/12 40
Nominalizing suffixes
için
‘in order to’ – (goal-driven)
‘due to’ – (purpose-driven)
V-NOM-(POSS)-(AGR)
-ME or –MAK
-DıK
4/30/12 41
NOM:nominalizaTonPOSS:possessiveAGR:personagreement
The distribution of genres in METU Turkish Corpus and Turkish Discourse Bank
4/30/12 42
THE MTC THE TDB
Genre # % # %
Novel 123 15.63 31 15.74 Story 114 14.49 28 14.21 Research/Survey 49 6.23 13 6.60 Article 38 4.83 9 4.57 Travel 19 2.41 5 2.54 Interview 7 0.89 2 1.02 Memoir 18 2.29 4 2.03 News 419 53.24 105 53.30 Total 787 100 197 100
Some examples of annotated connectives
4/30/12 43
Subodinators Phrasal expressions
Other (discourse adverbials, parallel connectives)
amaç- ‘goal’ (to this aim)
amacıyla, amaçla bu amaçla
dolayı- ‘due to; because’ (due to)
dolayı, dolayısıyla, dolayısı ile bundan dolayı, bu sebepten dolayı
neden- ‘reason’ (because)
nedeniyle, nedeni ile bu nedenle, o nedenle, bu nedenlerle, yukarıdaki nedenlerle
sonuç- ‘result’ (as a result)
sonucunda bunun sonucunda, bunların sonucunda,
sonuç olarak, sonuçta
zaman- ‘time’ (when; then)
zaman bir zamanda, aynı zamanda, o zaman,
bir zamanda, ne zaman… o zaman