+ All Categories
Home > Documents > Building the Turkish Discourse Bank - Penn Engineering - …pdtb2012/assets/... · 2012-05-24 ·...

Building the Turkish Discourse Bank - Penn Engineering - …pdtb2012/assets/... · 2012-05-24 ·...

Date post: 31-Mar-2019
Category:
Upload: ngotu
View: 212 times
Download: 0 times
Share this document with a friend
43
Building the Turkish Discourse Bank Deniz Zeyrek and Cem Bozşahin Cognitive Science, Informatics Institute, METU, Ankara PDTB 2012 Workshop, April 30-May 2 2012, IRCS, University of Pennsylvania Joint work with Işın Demirşahin, Ayışığı B. Sevdik-Çallı, İhsan Yalçınkaya, Ümit Deniz Turan, Ruken Çakıcı, Hale Balaban-Ögel, Berfin Aktaş Middle East Technical University
Transcript

Building the Turkish Discourse Bank

Deniz Zeyrek and Cem Bozşahin Cognitive Science, Informatics Institute, METU, Ankara

PDTB 2012 Workshop, April 30-May 2 2012, IRCS, University of Pennsylvania

Joint work with Işın Demirşahin, Ayışığı B. Sevdik-Çallı, İhsan Yalçınkaya, Ümit Deniz Turan, Ruken

Çakıcı, Hale Balaban-Ögel, Berfin Aktaş

MiddleEastTechnicalUniversity

Outline

  PDTB-style annotation   Which complex dependencies discovered   The prospects

4/30/12 2

PDTB style annotation (Prasad et al., 2008)

  Guidelines (a set of explicit connectives to annotate)

  Discourse use of connectives   Abstract object criterion (Asher, 1993)

  predicate-argument structure (Arg1-Arg2-Conn)

  Minimality principle 4/30/12 3

Turkish Discourse Bank (TDB): first release February, 2011

  8483 annotations on ~400,000 word-corpus

  METU Turkish Corpus: ~2 million words

  Can be requested from: www.tdb.ii.metu.edu.tr.

  includes raw text files, annotation files, annotation guidelines, and a browser

4/30/12 4

The tagset

4/30/12 5

Conn Connective

Arg1 First argument of the connective

Arg2 Second argument of the connective

Supp1 Supplement to the first argument

Supp2 Supplement to the second argument

Shared The subject, object or adverbial phrase shared by a relation

Shared supp Supplement for the shared material

Mod Modifier of the connective or the modifier of the relation

Annotation and browser tools

  Annotation tool – produces XML files as annotation data – uses stand-off annotation method

  Browser supports regexp search across files – connective-arg visualization

4/30/12 6

The browser: search facility

4/30/12 7

The browser: advanced search facility

4/30/12 8

Markables

  Coordinating conjunctions   Subordinators

– complex subordinators (için ‘for’, karşın ‘although, despite’)

– simplex subordinators, i.e. converbs (-IncA ‘when,’ –ken ‘while, now that’)

  Discourse adverbials

4/30/12 9

A simplex subordinator: -dEn ‘since’

  Arg2-Conn-Arg1

Ne kadar üzüleceğini bil-DIĞ-İm-Den, How much you would worry I knew-Dık-

Agr-Conn,

yarına erteledim yolculuğumu. I postponed my journey for tomorrow. 4/30/12 10

A complex subordinator: kadar ‘until’ (two-part connectives)

  Arg2-Conn-Arg1

Gergin gövdemi uykulu bir karanlığın içinde hafiflemiş, erimiş duy-Unca-Ya kadar bekledim.

I waited until I felt-Conn-Dat my stressed body relaxing and melting in darkness.

4/30/12 11

Word-order flexibility: the shared tag

  Arg2-Conn-Arg1

Sınırlı olmasına rağmen {bu devrimci kongreler}, sarayın değil, halkın demokratik ihtilalinin eseriydiler.

Despite the fact that they were limited, {these revolutionist congresses}were not a result of the empire but the people’s democratic rebellion.

4/30/12 12

Annotation procedure

  The regexp search returns various uses of one root – neden ‘reason’ returns:

•  subordinator tokens: –  V-mEsI neden-iy-le ‘reason+POSS+INS’ (‘V-NOM due to’)

•  phrasal expressions: –  neden-le reason+INS’- (as in bu nedenle ‘due to this’)

4/30/12 13

POSS:possessiveINS:instrumentalNOM:nominalizingsuffix

  77 connective tokens searched

  143 connectives (types) annotated including phrasal expressions

4/30/12 14

Agreement statistics: Krippendorff’s alpha (word boundary)

Arg1 Arg2

ama ‘but’ coordinating conj.

0.95630455 0.84201956 ne .. ne ‘neither .. nor’ 1 0.93903977için ’due to, in order to’ subordinator 0.16414972 0.80815035amacıyla ‘for the purpose of’ 0.43203524 0.95423687 böylece ‘thus’

discourse adverbial 0.7870529 0.99896985

tersine ‘on the contrary’ 0.43093222 1

4/30/12 15

Nominalizations as arguments

  clauses based on the factive nominalizer –DıK and the nonfactive nominalizer –ACAK,

  clauses based on the infinitives –MA or –MAK

  clauses formed on the nonfinite nominal marker –ıŞ (Csató, 1998:230).

4/30/12 16

Total number of annotations

4/30/12 17

Single Parallel Total

Coordinating Conjunction

Tokens 15 12 27

Relations 4348 129 4477

Subordinating Conjunction

Tokens 31 1 32

Relations 2285 2 2287

Discourse Adverbial

Tokens 32 18 50

Relations 1152 73 1225

Phrasal Expression

Tokens 40 1 41

Relations 490 4 494

Total

Tokens 118 32 150

Relations 8275 208 8483

Dependency configurations

  We found 8 dependency configurations between connectives and their arguments (Aktaş et al. 2010, a la Lee et al. 2006).

  One pattern (nested relation) seems novel.

4/30/12 18

Types of dependencies in discourse (Lee et al. 2006)

Tree-structured   independent relations   full embedding

Non-tree structured   shared argument   properly contained

argument   properly contained

relation   nested relations   pure crossing

4/30/12 19

1. Independent Relations

{Köpekler} bahçenin öbür tarafında olmalarına rağmen (1) havlamaya başlamışlardı. Zincirleri çözülmemişti, ama (2) her an koparabilirlerdi.

Although(1) on the other side of the garden, {the dogs} had started to bark. (Their) chains were not loose, but (2)(they) could break (them) any minute now.

4/30/12 20

2. Full Embedding

İlgisizliğim seni şaşırtabilir ama (1) üvey babamı görmek istemediğim için (2) yıllardır o eve gitmiyorum.

My lack of interest might surprise you but(1) since(2) I don't want to see my step father, I haven't been to that house for years.

4/30/12 21

ama‘but’(1)

Arg1:mylack..youArg2:since..years

için‘since’(2)

Arg1:Idon’t..fatherArg2:Ihaven’t..years

3. Shared Argument

Çevremde, o an duygularımdaki kargaşa yüzünden sahiplerini seçemediğim gülüşmeler oldu. Sonra (1) babam da güldü. Ama (2) onunkisi keyifli, neredeyse övünçlü bir gülüştü.

There were laughs whose owners I could not identify due to my emotional turmoil at the time. Then (1) my father laughed too. But (2) his was a joyful, almost a boastful laugh.

4/30/12 22

sonra‘then’(1)

Arg1:therewere..TmeArg2:myfather…laugh

but‘ama’(2)

Arg1:thenmy…tooArg2:hiswas..laugh

4. Properly Contained Argument Kapıdan girdi ve (1) “söyler misin, hiç etkilenmedin mi yazdıklarından?” dedi.

“Tersine (2), çok etkilendim.” She entered through the door and (1) asked “Tell me, are you not touched at all

by what he wrote?”. “On the contrary (2), I am very much affected.”

4/30/12 23

ve ‘and’ (1)

Arg1: she entered ..door Arg2: asked … wrote?

tersine ‘on the contrary’ (2)

Arg1: are you ..wrote? Arg2: I am .. affected

5. Properly Contained Relation

Proleterya devrimi, kapitalizmin en fazla geliştiği (dolayısıyla (2) toplumsal çelişkilerin en keskin olduğu) yerde gündeme gelir ve (1) devrim tek ülkede değil Avrupa çapında olanaklıdır.

The proletarian revolution is likely to emerge in places where capitalism is highly developed (thus(2) social conflicts are the sharpest), and (1) the revolution is not likely to happen in a single country, but across Europe.

4/30/12 24

ve‘and’(1)

Arg1: the proletarian … sharpest Arg2: the revolution … Europe

dolayısıyla ‘thus’ (2)

Arg1: the proletarian … developed Arg2: the social … sharpest

6. Nested Relations Ben de sahiplerim açısından şanslı kedilerden sayılırım. Hem iyi baktılar hem de (2)

kediliğimden özveride bulunmamı, özgürlüğümü satmamı hiç istemediler. Ama (1) her kedi bizim kadar şanslı olmayabilir.

I am also one of the lucky cats in terms of my owners. They both took good care of me and (2) didn’t demand that I let go of my cathood and sell out my freedom. But(1)not all cats may be as lucky as we are.

4/30/12 25

but ‘ama’ (1) Arg1: I am .. owners Arg2: not all … we are

hem ... hem de ‘both …and’ (2)

Arg1: they .. me Arg2: didn’t … freedom

7. Pure Crossing Sonra ansızın sesler gelir. Ayak sesleri. Birilerinin ya işi vardır, aceleyle yürürler, ya

koşarlar. O zaman (1) kız katılaşır ansızın. Oğlan da katılaşır ve (2) her koşunun gizli bir isteği var. Bunu biz bilemeyiz.

Then suddenly there are voices. Footsteps. People must be running errands, walking in a hurry or running. Then (1) all of a sudden the girl freezes. The boy also freezes, and (2) each run ought to have a secret wish. That we cannot know.

4/30/12 26

ozaman‘then’ (1)Arg1:thensuddenly…voicesArg2:allofa…freezes

and‘ve’ (2)Arg1:people…runningArg2:eachrun…wish

8. Partial Overlap Schily'nin Ankara'yı 20-21 Haziran'da ziyaret etmesi planlanmış, ancak Türkiye'de

ölüm cezasının o tarihte henüz kaldırılmamış olması nedeniyle (1) sürdürülen teknik müzakerelerde ilerleme sağlanamamış ve (2) ziyaret ertelenmişti.

It was planned that Schily would visit Ankara on June 20-21, however, since(1) capital punishment had not been not abolished then in Turkey, no

progress could be made in the ongoing technical debates, and (2) the visit was postponed.

4/30/12 27

nedeniyle ‘ since’ (1)

Arg1: capital … debates Arg2: no progress … debates

and ‘ve’ (2)

Arg1: it was planned … debates Arg2: the visit .. postponed

Summary

  TDB uses PDTB guidelines   It annotates:

–  explicit connectives – nominalizations as arguments – phrasal expressions – shared tag

4/30/12 28

  Shared arguments, nested relations and partially overlapping argument configurations are quite frequent

  we cannot appeal to discourse notions like anaphora, attribution or a parenthetical

  genuine discourse structures?

4/30/12 29

Prospects

  the role of lexical relations   frequency of types of complex

dependencies   genre-complex dependencies   superimposed structures?   centers (proved very useful for discourse

use of pro-drop)

4/30/12 30

Ongoing & future work

  sense annotation   data mining

4/30/12 31

  Thank you.

4/30/12 32

Bibliography

  Asher, N. (1993). Reference to Abstract Objects in Discourse. Kluwer Academic Publishers.

  Aktaş, B., Bozşahin, C., and Zeyrek, D. (2010). Discourse Relation Configurations in Turkish and an Annotation Environment. In Proceedings of the Fourth Linguistic Annotation Workshop (LAW IV).

  Csató, Éva Á. (1988). Turkish. In Lars Johanson and Éva Á. Csató, editors. (1998). The Turkic Languages (Routledge Language Family Series), pages 203-235, Routledge, London.

  Alan Lee, Rashmi Prasad, Aravind Joshi, Nikhil Dinesh and Bonnie Webber. (2006). Complexity of dependencies in discourse. In Proceedings of the 5th InternationalWorkshoponTreebanksand Linguistic Theories.

  Christian Lehmann (1988). Towards a typology of clause linkage. In John Haiman and Sandra A.

  Thompson, editors. Clause Combining in Grammar and Discourse, pages 181-225, John

  Benjamins Publishing, Amsterdam; Philadelphia.   Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B.

(2008). The Penn Discourse Treebank 2.0. In Proceedings of the Sixth International Conference on Language Resources and Evaluation. (LREC’08).

4/30/12 33

  Back-up slides

4/30/12 34

Breakdown

4/30/12 35

Krippendorf’s alpha (boundary approach)

4/30/12 36

Most frequent 5 explicit connectives in TDB and the frequency of their non-discourse use

4/30/12 37

Discourse connectives Other uses Total instances

Conn # % # % # %

ve ‘and’ 2112 28.2 5389 71.8 7501 100.0

için ‘because’ 1102 50.9 1063 49.1 2165 100.0

ama ‘but’ 1024 90.6 106 9.4 1130 100.0

sonra ‘later’ 713 56.7 544 43.3 1257 100.0

ancak ‘however’ 419 79.1 111 20.9 530 100.0

The connective için

4/30/12 38

Goaldriveninf(‐mAk)için 510 ‐mA‐POSS‐AGRiçin 239 ‐mAiçin 6 ‐Iş‐POSS‐AGRiçin 6 ‐Işiçin 2 ‐Im‐POSS‐AGRiçin 7 GoalTotal 770 Causedriven

‐dIğI‐AGRiçin 276 ‐(A)cAğI‐AGRiçin 12 Causetotal 288 İçintotal 1058

Phrasal expressions

  functionality of the annotation tool

  the deictic element is the second argument of the expression morphologically similar to the second argument of a subordinator

4/30/12 39

Subordinator İhaleli sisteme geç-IŞ-IN

ardından bu ihalelere katılmayan 10 kadar firma takibe alındı.

After the shift [shift-Iş-GEN] to the bidding system, legal action was taken against circa 10 firms that did not undertake bidding processes.

Phrasal Expression after the shift [shift-NOM-

GEN]

Bu-nun ardından … After this [this-GEN] …

4/30/12 40

Nominalizing suffixes

  için

  ‘in order to’ –  (goal-driven)

  ‘due to’ –  (purpose-driven)

  V-NOM-(POSS)-(AGR)

  -ME or –MAK

  -DıK

4/30/12 41

NOM:nominalizaTonPOSS:possessiveAGR:personagreement

The distribution of genres in METU Turkish Corpus and Turkish Discourse Bank

4/30/12 42

THE MTC THE TDB

Genre # % # %

Novel 123 15.63 31 15.74 Story 114 14.49 28 14.21 Research/Survey 49 6.23 13 6.60 Article 38 4.83 9 4.57 Travel 19 2.41 5 2.54 Interview 7 0.89 2 1.02 Memoir 18 2.29 4 2.03 News 419 53.24 105 53.30 Total 787 100 197 100

Some examples of annotated connectives

4/30/12 43

Subodinators Phrasal expressions

Other (discourse adverbials, parallel connectives)

amaç- ‘goal’ (to this aim)

amacıyla, amaçla bu amaçla

dolayı- ‘due to; because’ (due to)

dolayı, dolayısıyla, dolayısı ile bundan dolayı, bu sebepten dolayı

neden- ‘reason’ (because)

nedeniyle, nedeni ile bu nedenle, o nedenle, bu nedenlerle, yukarıdaki nedenlerle

sonuç- ‘result’ (as a result)

sonucunda bunun sonucunda, bunların sonucunda,

sonuç olarak, sonuçta

zaman- ‘time’ (when; then)

zaman bir zamanda, aynı zamanda, o zaman,

bir zamanda, ne zaman… o zaman


Recommended