Proceedings of the icgL12Πρακτικα του icgL12
thanasis georgakopoulos theodossia-soula Pavlidou Miltos Pechlivanos Artemis Alexiadou Jannis Androutsopoulos Alexis Kalokairinos stavros skopeteas Katerina stathi (eds)
Proceedings of the 12th internAtionAL conference on greeK Linguistics
Πρακτικα του 12ου συνεδριου ελληνικησ γλωσσολογιασ
vol 1
copy 2017 Edition RomiosiniCeMoG Freie Universitaumlt Berlin Alle Rechte vorbehaltenvertrieb und Gesamtherstellung Epubli (wwwepublide)Satz und layout Rea Papamichail Center fuumlr Digitale Systeme Freie Universitaumlt BerlinGesetzt aus Minion ProUmschlaggestaltung Thanasis Georgiou Yorgos KonstantinouUmschlagillustration Yorgos Konstantinou
ISBN 978-3-946142-34-8Printed in Germany
online-Bibliothek der Edition Romiosiniwwwedition-romiosinide
Στη μνήμη του Gaberell Drachman (dagger1092014) και της Αγγελικής Μαλικούτη-Drachman (dagger452015)
για την τεράστια προσφορά τους στην ελληνική γλωσσολογία και την αγάπη τους για την ελληνική γλώσσα
ΣΗΜΕΙΩΜΑ ΕΚΔΟΤΩΝ
Το 12ο Διεθνές Συνέδριο Ελληνικής Γλωσσολογίας (International Conference on Greek linguisticsICGl12) πραγματοποιήθηκε στο Κέντρο Νέου Ελληνισμού του Ελεύθερου Πανεπιστημίου του Βερολίνου (Centrum Modernes Griechenland Freie Universitaumlt Berlin) στις 16-19 Σεπτεμβρίου 2015 με τη συμμετοχή περίπου τετρακοσί-ων συνέδρων απrsquo όλον τον κόσμο
Την Επιστημονική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπου-λος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Δώρα Αλεξοπούλου Γιάννης Ανδρουτσόπουλος Αμαλία Αρβανίτη Σταύρος Ασημακόπου-λος Αλεξάνδρα Γεωργακοπούλου Κλεάνθης Γκρώμαν Σαβίνα Ιατρίδου Mark Janse Brian Joseph Αλέξης Καλοκαιρινός Ναπολέων Κάτσος Ευαγγελία Κορδώνη Αμα-λία Μόζερ Ελένη Μπουτουλούση Κική Νικηφορίδου Αγγελική Ράλλη Άννα Ρούσ-σου Αθηνά Σιούπη Σταύρος Σκοπετέας Κατερίνα Στάθη Μελίτα Σταύρου Αρχόντω Τερζή Νίνα Τοπιντζή Ιάνθη Τσιμπλή και Σταυρούλα Τσιπλάκου
Την Οργανωτική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπουλος Αλέξης Καλοκαιρινός Κώστας Κοσμάς Θεοδοσία-Σούλα Παυλίδου και Μίλτος Πε-χλιβάνος
Οι δύο τόμοι των πρακτικών του συνεδρίου είναι προϊόν της εργασίας της Εκδο-τικής Επιτροπής στην οποία συμμετείχαν οι Θανάσης Γεωργακόπουλος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Γιάννης Ανδρουτσόπου-λος Αλέξης Καλοκαιρινός Σταύρος Σκοπετέας και Κατερίνα Στάθη
Παρότι στο συνέδριο οι ανακοινώσεις είχαν ταξινομηθεί σύμφωνα με θεματικούς άξονες τα κείμενα των ανακοινώσεων παρατίθενται σε αλφαβητική σειρά σύμφωνα με το λατινικό αλφάβητο εξαίρεση αποτελούν οι εναρκτήριες ομιλίες οι οποίες βρί-σκονται στην αρχή του πρώτου τόμου
Η Οργανωτική Επιτροπή του ICGl12
ΠΕΡΙΕχΟΜΕΝΑ
Σημείωμα εκδοτών 7
Περιεχόμενα 9
Peter MackridgeSome literary representations of spoken Greek before nationalism(1750-1801) 17
Μαρία ΣηφιανούΗ έννοια της ευγένειας στα Eλληνικά 45
Σπυριδούλα Βαρλοκώστα Syntactic comprehension in aphasia and its relationship to working memory deficits 75
Ευαγγελία Αχλάδη Αγγελική Δούρη Ευγενία Μαλικούτη amp χρυσάνθη Παρασχάκη-ΜπαράνΓλωσσικά λάθη τουρκόφωνων μαθητών της Ελληνικής ως ξένηςδεύτερης γλώσσας Ανάλυση και διδακτική αξιοποίηση 109
Κατερίνα ΑλεξανδρήΗ μορφή και η σημασία της διαβάθμισης στα επίθετα που δηλώνουν χρώμα 125
Eva Anastasi Ageliki logotheti Stavri Panayiotou Marilena Serafim amp Charalambos Themistocleous A Study of Standard Modern Greek and Cypriot Greek Stop Consonants Preliminary Findings 141
Anna Anastassiadis-Symeonidis Elisavet Kiourti amp Maria MitsiakiInflectional Morphology at the service of Lexicography ΚΟΜOΛεξ A Cypriot Mοrphological Dictionary 157
Γεωργία Ανδρέου amp Ματίνα ΤασιούδηΗ ανάπτυξη του λεξιλογίου σε παιδιά με Σύνδρομο Απνοιών στον Ύπνο 175
Ανθούλα- Ελευθερία Ανδρεσάκη Ιατρικές μεταφορές στον δημοσιογραφικό λόγο της κρίσης Η οπτική γωνία των Γερμανών 187
Μαρία ΑνδριάΠροσεγγίζοντας θέματα Διαγλωσσικής Επίδρασης μέσα από το πλαίσιο της Γνωσιακής Γλωσσολογίας ένα παράδειγμα από την κατάκτηση της Ελληνικής ως Γ2 199
Spyros Armostis amp Kakia PetinouMastering word-initial syllable onsets by Cypriot Greek toddlers with and without early language delay 215
Julia Bacskai-AtkariAmbiguity and the Internal Structure of Comparative Complements in Greek 231
Costas CanakisTalking about same-sex parenthood in contemporary Greece Dynamic categorization and indexicality 243
Michael ChiouThe pragmatics of future tense in Greek 257
Maria Chondrogianni The Pragmatics of the Modern Greek Segmental Μarkers 269
Katerina Christopoulou George J Xydopoulos ampAnastasios TsangalidisGrammatical gender and offensiveness in Modern Greek slang vocabulary 291
Aggeliki Fotopoulou vasiliki Foufi Tita Kyriacopoulou amp Claude Martineau Extraction of complex text segments in Modern Greek 307
Aγγελική Φωτοπούλου amp Βούλα ΓιούληΑπό την laquoΈκφρασηraquo στο laquoΠολύτροποraquo σχεδιασμός και οργάνωση ενός εννοιολογικού λεξικού 327
Marianthi Georgalidou Sofia lampropoulou Maria Gasouka Apostolos Kostas amp Xan-thippi FoulidildquoLearn grammarrdquo Sexist language and ideology in a corpus of Greek Public Documents 341
Maria Giagkou Giorgos Fragkakis Dimitris Pappas amp Harris PapageorgiouFeature extraction and analysis in Greek L2 texts in view of automatic labeling for proficiency levels 357
Dionysis Goutsos Georgia Fragaki Irene Florou vasiliki Kakousi amp Paraskevi SavvidouThe Diachronic Corpus of Greek of the 20th century Design and compilation 369
Kleanthes K Grohmann amp Maria KambanarosBilectalism Comparative Bilingualism and theGradience of Multilingualism A View from Cyprus 383
Guumlnther S Henrich bdquoΓεωγραφία νεωτερικήldquo στο Λίβιστρος και Ροδάμνη μετατόπιση ονομάτων βαλτικών χωρών προς την Ανατολή 397
Noriyo Hoozawa-Arkenau amp Christos KarvounisVergleichende Diglossie - Aspekte im Japanischen und Neugriechischen Verietaumlten - Interferenz 405
Μαρία Ιακώβου Ηριάννα Βασιλειάδη-Λιναρδάκη Φλώρα Βλάχου Όλγα Δήμα Μαρία Καββαδία Τατιάνα Κατσίνα Μαρίνα Κουτσουμπού Σοφία-Νεφέλη Κύτρου χριστίνα Κωστάκου Φρόσω Παππά amp Σταυριαλένα ΠερρέαΣΕΠΑΜΕ2 Μια καινούρια πηγή αναφοράς για την Ελληνική ως Γ2 419
Μαρία Ιακώβου amp Θωμαΐς ΡουσουλιώτηΒασικές αρχές σχεδιασμού και ανάπτυξης του νέου μοντέλου αναλυτικών προγραμμάτων για τη διδασκαλία της Eλληνικής ως δεύτερηςξένης γλώσσας 433
Μαρία Καμηλάκη laquoΜαζί μου ασχολείσαι πόσο μαλάκας είσαιraquo Λέξεις-ταμπού και κοινωνιογλωσσικές ταυτότητες στο σύγχρονο ελληνόφωνο τραγούδι 449
Μαρία Καμηλάκη Γεωργία Κατσούδα amp Μαρία Βραχιονίδου Η εννοιολογική μεταφορά σε λέξεις-ταμπού της ΝΕΚ και των νεοελληνικών διαλέκτων 465
Eleni Karantzola Georgios Mikros amp Anastassios Papaioannou Lexico-grammatical variation and stylometric profile of autograph texts in Early Modern Greek 479
Sviatlana Karpava Maria Kambanaros amp Kleanthes K GrohmannNarrative Abilities MAINing RussianndashGreek Bilingual Children in Cyprus 493
χρήστος Καρβούνης Γλωσσικός εξαρχαϊσμός και laquoιδεολογικήraquo νόρμα Ζητήματα γλωσσικής διαχείρισης στη νέα ελληνική 507
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
thanasis georgakopoulos theodossia-soula Pavlidou Miltos Pechlivanos Artemis Alexiadou Jannis Androutsopoulos Alexis Kalokairinos stavros skopeteas Katerina stathi (eds)
Proceedings of the 12th internAtionAL conference on greeK Linguistics
Πρακτικα του 12ου συνεδριου ελληνικησ γλωσσολογιασ
vol 1
copy 2017 Edition RomiosiniCeMoG Freie Universitaumlt Berlin Alle Rechte vorbehaltenvertrieb und Gesamtherstellung Epubli (wwwepublide)Satz und layout Rea Papamichail Center fuumlr Digitale Systeme Freie Universitaumlt BerlinGesetzt aus Minion ProUmschlaggestaltung Thanasis Georgiou Yorgos KonstantinouUmschlagillustration Yorgos Konstantinou
ISBN 978-3-946142-34-8Printed in Germany
online-Bibliothek der Edition Romiosiniwwwedition-romiosinide
Στη μνήμη του Gaberell Drachman (dagger1092014) και της Αγγελικής Μαλικούτη-Drachman (dagger452015)
για την τεράστια προσφορά τους στην ελληνική γλωσσολογία και την αγάπη τους για την ελληνική γλώσσα
ΣΗΜΕΙΩΜΑ ΕΚΔΟΤΩΝ
Το 12ο Διεθνές Συνέδριο Ελληνικής Γλωσσολογίας (International Conference on Greek linguisticsICGl12) πραγματοποιήθηκε στο Κέντρο Νέου Ελληνισμού του Ελεύθερου Πανεπιστημίου του Βερολίνου (Centrum Modernes Griechenland Freie Universitaumlt Berlin) στις 16-19 Σεπτεμβρίου 2015 με τη συμμετοχή περίπου τετρακοσί-ων συνέδρων απrsquo όλον τον κόσμο
Την Επιστημονική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπου-λος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Δώρα Αλεξοπούλου Γιάννης Ανδρουτσόπουλος Αμαλία Αρβανίτη Σταύρος Ασημακόπου-λος Αλεξάνδρα Γεωργακοπούλου Κλεάνθης Γκρώμαν Σαβίνα Ιατρίδου Mark Janse Brian Joseph Αλέξης Καλοκαιρινός Ναπολέων Κάτσος Ευαγγελία Κορδώνη Αμα-λία Μόζερ Ελένη Μπουτουλούση Κική Νικηφορίδου Αγγελική Ράλλη Άννα Ρούσ-σου Αθηνά Σιούπη Σταύρος Σκοπετέας Κατερίνα Στάθη Μελίτα Σταύρου Αρχόντω Τερζή Νίνα Τοπιντζή Ιάνθη Τσιμπλή και Σταυρούλα Τσιπλάκου
Την Οργανωτική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπουλος Αλέξης Καλοκαιρινός Κώστας Κοσμάς Θεοδοσία-Σούλα Παυλίδου και Μίλτος Πε-χλιβάνος
Οι δύο τόμοι των πρακτικών του συνεδρίου είναι προϊόν της εργασίας της Εκδο-τικής Επιτροπής στην οποία συμμετείχαν οι Θανάσης Γεωργακόπουλος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Γιάννης Ανδρουτσόπου-λος Αλέξης Καλοκαιρινός Σταύρος Σκοπετέας και Κατερίνα Στάθη
Παρότι στο συνέδριο οι ανακοινώσεις είχαν ταξινομηθεί σύμφωνα με θεματικούς άξονες τα κείμενα των ανακοινώσεων παρατίθενται σε αλφαβητική σειρά σύμφωνα με το λατινικό αλφάβητο εξαίρεση αποτελούν οι εναρκτήριες ομιλίες οι οποίες βρί-σκονται στην αρχή του πρώτου τόμου
Η Οργανωτική Επιτροπή του ICGl12
ΠΕΡΙΕχΟΜΕΝΑ
Σημείωμα εκδοτών 7
Περιεχόμενα 9
Peter MackridgeSome literary representations of spoken Greek before nationalism(1750-1801) 17
Μαρία ΣηφιανούΗ έννοια της ευγένειας στα Eλληνικά 45
Σπυριδούλα Βαρλοκώστα Syntactic comprehension in aphasia and its relationship to working memory deficits 75
Ευαγγελία Αχλάδη Αγγελική Δούρη Ευγενία Μαλικούτη amp χρυσάνθη Παρασχάκη-ΜπαράνΓλωσσικά λάθη τουρκόφωνων μαθητών της Ελληνικής ως ξένηςδεύτερης γλώσσας Ανάλυση και διδακτική αξιοποίηση 109
Κατερίνα ΑλεξανδρήΗ μορφή και η σημασία της διαβάθμισης στα επίθετα που δηλώνουν χρώμα 125
Eva Anastasi Ageliki logotheti Stavri Panayiotou Marilena Serafim amp Charalambos Themistocleous A Study of Standard Modern Greek and Cypriot Greek Stop Consonants Preliminary Findings 141
Anna Anastassiadis-Symeonidis Elisavet Kiourti amp Maria MitsiakiInflectional Morphology at the service of Lexicography ΚΟΜOΛεξ A Cypriot Mοrphological Dictionary 157
Γεωργία Ανδρέου amp Ματίνα ΤασιούδηΗ ανάπτυξη του λεξιλογίου σε παιδιά με Σύνδρομο Απνοιών στον Ύπνο 175
Ανθούλα- Ελευθερία Ανδρεσάκη Ιατρικές μεταφορές στον δημοσιογραφικό λόγο της κρίσης Η οπτική γωνία των Γερμανών 187
Μαρία ΑνδριάΠροσεγγίζοντας θέματα Διαγλωσσικής Επίδρασης μέσα από το πλαίσιο της Γνωσιακής Γλωσσολογίας ένα παράδειγμα από την κατάκτηση της Ελληνικής ως Γ2 199
Spyros Armostis amp Kakia PetinouMastering word-initial syllable onsets by Cypriot Greek toddlers with and without early language delay 215
Julia Bacskai-AtkariAmbiguity and the Internal Structure of Comparative Complements in Greek 231
Costas CanakisTalking about same-sex parenthood in contemporary Greece Dynamic categorization and indexicality 243
Michael ChiouThe pragmatics of future tense in Greek 257
Maria Chondrogianni The Pragmatics of the Modern Greek Segmental Μarkers 269
Katerina Christopoulou George J Xydopoulos ampAnastasios TsangalidisGrammatical gender and offensiveness in Modern Greek slang vocabulary 291
Aggeliki Fotopoulou vasiliki Foufi Tita Kyriacopoulou amp Claude Martineau Extraction of complex text segments in Modern Greek 307
Aγγελική Φωτοπούλου amp Βούλα ΓιούληΑπό την laquoΈκφρασηraquo στο laquoΠολύτροποraquo σχεδιασμός και οργάνωση ενός εννοιολογικού λεξικού 327
Marianthi Georgalidou Sofia lampropoulou Maria Gasouka Apostolos Kostas amp Xan-thippi FoulidildquoLearn grammarrdquo Sexist language and ideology in a corpus of Greek Public Documents 341
Maria Giagkou Giorgos Fragkakis Dimitris Pappas amp Harris PapageorgiouFeature extraction and analysis in Greek L2 texts in view of automatic labeling for proficiency levels 357
Dionysis Goutsos Georgia Fragaki Irene Florou vasiliki Kakousi amp Paraskevi SavvidouThe Diachronic Corpus of Greek of the 20th century Design and compilation 369
Kleanthes K Grohmann amp Maria KambanarosBilectalism Comparative Bilingualism and theGradience of Multilingualism A View from Cyprus 383
Guumlnther S Henrich bdquoΓεωγραφία νεωτερικήldquo στο Λίβιστρος και Ροδάμνη μετατόπιση ονομάτων βαλτικών χωρών προς την Ανατολή 397
Noriyo Hoozawa-Arkenau amp Christos KarvounisVergleichende Diglossie - Aspekte im Japanischen und Neugriechischen Verietaumlten - Interferenz 405
Μαρία Ιακώβου Ηριάννα Βασιλειάδη-Λιναρδάκη Φλώρα Βλάχου Όλγα Δήμα Μαρία Καββαδία Τατιάνα Κατσίνα Μαρίνα Κουτσουμπού Σοφία-Νεφέλη Κύτρου χριστίνα Κωστάκου Φρόσω Παππά amp Σταυριαλένα ΠερρέαΣΕΠΑΜΕ2 Μια καινούρια πηγή αναφοράς για την Ελληνική ως Γ2 419
Μαρία Ιακώβου amp Θωμαΐς ΡουσουλιώτηΒασικές αρχές σχεδιασμού και ανάπτυξης του νέου μοντέλου αναλυτικών προγραμμάτων για τη διδασκαλία της Eλληνικής ως δεύτερηςξένης γλώσσας 433
Μαρία Καμηλάκη laquoΜαζί μου ασχολείσαι πόσο μαλάκας είσαιraquo Λέξεις-ταμπού και κοινωνιογλωσσικές ταυτότητες στο σύγχρονο ελληνόφωνο τραγούδι 449
Μαρία Καμηλάκη Γεωργία Κατσούδα amp Μαρία Βραχιονίδου Η εννοιολογική μεταφορά σε λέξεις-ταμπού της ΝΕΚ και των νεοελληνικών διαλέκτων 465
Eleni Karantzola Georgios Mikros amp Anastassios Papaioannou Lexico-grammatical variation and stylometric profile of autograph texts in Early Modern Greek 479
Sviatlana Karpava Maria Kambanaros amp Kleanthes K GrohmannNarrative Abilities MAINing RussianndashGreek Bilingual Children in Cyprus 493
χρήστος Καρβούνης Γλωσσικός εξαρχαϊσμός και laquoιδεολογικήraquo νόρμα Ζητήματα γλωσσικής διαχείρισης στη νέα ελληνική 507
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
copy 2017 Edition RomiosiniCeMoG Freie Universitaumlt Berlin Alle Rechte vorbehaltenvertrieb und Gesamtherstellung Epubli (wwwepublide)Satz und layout Rea Papamichail Center fuumlr Digitale Systeme Freie Universitaumlt BerlinGesetzt aus Minion ProUmschlaggestaltung Thanasis Georgiou Yorgos KonstantinouUmschlagillustration Yorgos Konstantinou
ISBN 978-3-946142-34-8Printed in Germany
online-Bibliothek der Edition Romiosiniwwwedition-romiosinide
Στη μνήμη του Gaberell Drachman (dagger1092014) και της Αγγελικής Μαλικούτη-Drachman (dagger452015)
για την τεράστια προσφορά τους στην ελληνική γλωσσολογία και την αγάπη τους για την ελληνική γλώσσα
ΣΗΜΕΙΩΜΑ ΕΚΔΟΤΩΝ
Το 12ο Διεθνές Συνέδριο Ελληνικής Γλωσσολογίας (International Conference on Greek linguisticsICGl12) πραγματοποιήθηκε στο Κέντρο Νέου Ελληνισμού του Ελεύθερου Πανεπιστημίου του Βερολίνου (Centrum Modernes Griechenland Freie Universitaumlt Berlin) στις 16-19 Σεπτεμβρίου 2015 με τη συμμετοχή περίπου τετρακοσί-ων συνέδρων απrsquo όλον τον κόσμο
Την Επιστημονική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπου-λος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Δώρα Αλεξοπούλου Γιάννης Ανδρουτσόπουλος Αμαλία Αρβανίτη Σταύρος Ασημακόπου-λος Αλεξάνδρα Γεωργακοπούλου Κλεάνθης Γκρώμαν Σαβίνα Ιατρίδου Mark Janse Brian Joseph Αλέξης Καλοκαιρινός Ναπολέων Κάτσος Ευαγγελία Κορδώνη Αμα-λία Μόζερ Ελένη Μπουτουλούση Κική Νικηφορίδου Αγγελική Ράλλη Άννα Ρούσ-σου Αθηνά Σιούπη Σταύρος Σκοπετέας Κατερίνα Στάθη Μελίτα Σταύρου Αρχόντω Τερζή Νίνα Τοπιντζή Ιάνθη Τσιμπλή και Σταυρούλα Τσιπλάκου
Την Οργανωτική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπουλος Αλέξης Καλοκαιρινός Κώστας Κοσμάς Θεοδοσία-Σούλα Παυλίδου και Μίλτος Πε-χλιβάνος
Οι δύο τόμοι των πρακτικών του συνεδρίου είναι προϊόν της εργασίας της Εκδο-τικής Επιτροπής στην οποία συμμετείχαν οι Θανάσης Γεωργακόπουλος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Γιάννης Ανδρουτσόπου-λος Αλέξης Καλοκαιρινός Σταύρος Σκοπετέας και Κατερίνα Στάθη
Παρότι στο συνέδριο οι ανακοινώσεις είχαν ταξινομηθεί σύμφωνα με θεματικούς άξονες τα κείμενα των ανακοινώσεων παρατίθενται σε αλφαβητική σειρά σύμφωνα με το λατινικό αλφάβητο εξαίρεση αποτελούν οι εναρκτήριες ομιλίες οι οποίες βρί-σκονται στην αρχή του πρώτου τόμου
Η Οργανωτική Επιτροπή του ICGl12
ΠΕΡΙΕχΟΜΕΝΑ
Σημείωμα εκδοτών 7
Περιεχόμενα 9
Peter MackridgeSome literary representations of spoken Greek before nationalism(1750-1801) 17
Μαρία ΣηφιανούΗ έννοια της ευγένειας στα Eλληνικά 45
Σπυριδούλα Βαρλοκώστα Syntactic comprehension in aphasia and its relationship to working memory deficits 75
Ευαγγελία Αχλάδη Αγγελική Δούρη Ευγενία Μαλικούτη amp χρυσάνθη Παρασχάκη-ΜπαράνΓλωσσικά λάθη τουρκόφωνων μαθητών της Ελληνικής ως ξένηςδεύτερης γλώσσας Ανάλυση και διδακτική αξιοποίηση 109
Κατερίνα ΑλεξανδρήΗ μορφή και η σημασία της διαβάθμισης στα επίθετα που δηλώνουν χρώμα 125
Eva Anastasi Ageliki logotheti Stavri Panayiotou Marilena Serafim amp Charalambos Themistocleous A Study of Standard Modern Greek and Cypriot Greek Stop Consonants Preliminary Findings 141
Anna Anastassiadis-Symeonidis Elisavet Kiourti amp Maria MitsiakiInflectional Morphology at the service of Lexicography ΚΟΜOΛεξ A Cypriot Mοrphological Dictionary 157
Γεωργία Ανδρέου amp Ματίνα ΤασιούδηΗ ανάπτυξη του λεξιλογίου σε παιδιά με Σύνδρομο Απνοιών στον Ύπνο 175
Ανθούλα- Ελευθερία Ανδρεσάκη Ιατρικές μεταφορές στον δημοσιογραφικό λόγο της κρίσης Η οπτική γωνία των Γερμανών 187
Μαρία ΑνδριάΠροσεγγίζοντας θέματα Διαγλωσσικής Επίδρασης μέσα από το πλαίσιο της Γνωσιακής Γλωσσολογίας ένα παράδειγμα από την κατάκτηση της Ελληνικής ως Γ2 199
Spyros Armostis amp Kakia PetinouMastering word-initial syllable onsets by Cypriot Greek toddlers with and without early language delay 215
Julia Bacskai-AtkariAmbiguity and the Internal Structure of Comparative Complements in Greek 231
Costas CanakisTalking about same-sex parenthood in contemporary Greece Dynamic categorization and indexicality 243
Michael ChiouThe pragmatics of future tense in Greek 257
Maria Chondrogianni The Pragmatics of the Modern Greek Segmental Μarkers 269
Katerina Christopoulou George J Xydopoulos ampAnastasios TsangalidisGrammatical gender and offensiveness in Modern Greek slang vocabulary 291
Aggeliki Fotopoulou vasiliki Foufi Tita Kyriacopoulou amp Claude Martineau Extraction of complex text segments in Modern Greek 307
Aγγελική Φωτοπούλου amp Βούλα ΓιούληΑπό την laquoΈκφρασηraquo στο laquoΠολύτροποraquo σχεδιασμός και οργάνωση ενός εννοιολογικού λεξικού 327
Marianthi Georgalidou Sofia lampropoulou Maria Gasouka Apostolos Kostas amp Xan-thippi FoulidildquoLearn grammarrdquo Sexist language and ideology in a corpus of Greek Public Documents 341
Maria Giagkou Giorgos Fragkakis Dimitris Pappas amp Harris PapageorgiouFeature extraction and analysis in Greek L2 texts in view of automatic labeling for proficiency levels 357
Dionysis Goutsos Georgia Fragaki Irene Florou vasiliki Kakousi amp Paraskevi SavvidouThe Diachronic Corpus of Greek of the 20th century Design and compilation 369
Kleanthes K Grohmann amp Maria KambanarosBilectalism Comparative Bilingualism and theGradience of Multilingualism A View from Cyprus 383
Guumlnther S Henrich bdquoΓεωγραφία νεωτερικήldquo στο Λίβιστρος και Ροδάμνη μετατόπιση ονομάτων βαλτικών χωρών προς την Ανατολή 397
Noriyo Hoozawa-Arkenau amp Christos KarvounisVergleichende Diglossie - Aspekte im Japanischen und Neugriechischen Verietaumlten - Interferenz 405
Μαρία Ιακώβου Ηριάννα Βασιλειάδη-Λιναρδάκη Φλώρα Βλάχου Όλγα Δήμα Μαρία Καββαδία Τατιάνα Κατσίνα Μαρίνα Κουτσουμπού Σοφία-Νεφέλη Κύτρου χριστίνα Κωστάκου Φρόσω Παππά amp Σταυριαλένα ΠερρέαΣΕΠΑΜΕ2 Μια καινούρια πηγή αναφοράς για την Ελληνική ως Γ2 419
Μαρία Ιακώβου amp Θωμαΐς ΡουσουλιώτηΒασικές αρχές σχεδιασμού και ανάπτυξης του νέου μοντέλου αναλυτικών προγραμμάτων για τη διδασκαλία της Eλληνικής ως δεύτερηςξένης γλώσσας 433
Μαρία Καμηλάκη laquoΜαζί μου ασχολείσαι πόσο μαλάκας είσαιraquo Λέξεις-ταμπού και κοινωνιογλωσσικές ταυτότητες στο σύγχρονο ελληνόφωνο τραγούδι 449
Μαρία Καμηλάκη Γεωργία Κατσούδα amp Μαρία Βραχιονίδου Η εννοιολογική μεταφορά σε λέξεις-ταμπού της ΝΕΚ και των νεοελληνικών διαλέκτων 465
Eleni Karantzola Georgios Mikros amp Anastassios Papaioannou Lexico-grammatical variation and stylometric profile of autograph texts in Early Modern Greek 479
Sviatlana Karpava Maria Kambanaros amp Kleanthes K GrohmannNarrative Abilities MAINing RussianndashGreek Bilingual Children in Cyprus 493
χρήστος Καρβούνης Γλωσσικός εξαρχαϊσμός και laquoιδεολογικήraquo νόρμα Ζητήματα γλωσσικής διαχείρισης στη νέα ελληνική 507
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
Στη μνήμη του Gaberell Drachman (dagger1092014) και της Αγγελικής Μαλικούτη-Drachman (dagger452015)
για την τεράστια προσφορά τους στην ελληνική γλωσσολογία και την αγάπη τους για την ελληνική γλώσσα
ΣΗΜΕΙΩΜΑ ΕΚΔΟΤΩΝ
Το 12ο Διεθνές Συνέδριο Ελληνικής Γλωσσολογίας (International Conference on Greek linguisticsICGl12) πραγματοποιήθηκε στο Κέντρο Νέου Ελληνισμού του Ελεύθερου Πανεπιστημίου του Βερολίνου (Centrum Modernes Griechenland Freie Universitaumlt Berlin) στις 16-19 Σεπτεμβρίου 2015 με τη συμμετοχή περίπου τετρακοσί-ων συνέδρων απrsquo όλον τον κόσμο
Την Επιστημονική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπου-λος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Δώρα Αλεξοπούλου Γιάννης Ανδρουτσόπουλος Αμαλία Αρβανίτη Σταύρος Ασημακόπου-λος Αλεξάνδρα Γεωργακοπούλου Κλεάνθης Γκρώμαν Σαβίνα Ιατρίδου Mark Janse Brian Joseph Αλέξης Καλοκαιρινός Ναπολέων Κάτσος Ευαγγελία Κορδώνη Αμα-λία Μόζερ Ελένη Μπουτουλούση Κική Νικηφορίδου Αγγελική Ράλλη Άννα Ρούσ-σου Αθηνά Σιούπη Σταύρος Σκοπετέας Κατερίνα Στάθη Μελίτα Σταύρου Αρχόντω Τερζή Νίνα Τοπιντζή Ιάνθη Τσιμπλή και Σταυρούλα Τσιπλάκου
Την Οργανωτική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπουλος Αλέξης Καλοκαιρινός Κώστας Κοσμάς Θεοδοσία-Σούλα Παυλίδου και Μίλτος Πε-χλιβάνος
Οι δύο τόμοι των πρακτικών του συνεδρίου είναι προϊόν της εργασίας της Εκδο-τικής Επιτροπής στην οποία συμμετείχαν οι Θανάσης Γεωργακόπουλος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Γιάννης Ανδρουτσόπου-λος Αλέξης Καλοκαιρινός Σταύρος Σκοπετέας και Κατερίνα Στάθη
Παρότι στο συνέδριο οι ανακοινώσεις είχαν ταξινομηθεί σύμφωνα με θεματικούς άξονες τα κείμενα των ανακοινώσεων παρατίθενται σε αλφαβητική σειρά σύμφωνα με το λατινικό αλφάβητο εξαίρεση αποτελούν οι εναρκτήριες ομιλίες οι οποίες βρί-σκονται στην αρχή του πρώτου τόμου
Η Οργανωτική Επιτροπή του ICGl12
ΠΕΡΙΕχΟΜΕΝΑ
Σημείωμα εκδοτών 7
Περιεχόμενα 9
Peter MackridgeSome literary representations of spoken Greek before nationalism(1750-1801) 17
Μαρία ΣηφιανούΗ έννοια της ευγένειας στα Eλληνικά 45
Σπυριδούλα Βαρλοκώστα Syntactic comprehension in aphasia and its relationship to working memory deficits 75
Ευαγγελία Αχλάδη Αγγελική Δούρη Ευγενία Μαλικούτη amp χρυσάνθη Παρασχάκη-ΜπαράνΓλωσσικά λάθη τουρκόφωνων μαθητών της Ελληνικής ως ξένηςδεύτερης γλώσσας Ανάλυση και διδακτική αξιοποίηση 109
Κατερίνα ΑλεξανδρήΗ μορφή και η σημασία της διαβάθμισης στα επίθετα που δηλώνουν χρώμα 125
Eva Anastasi Ageliki logotheti Stavri Panayiotou Marilena Serafim amp Charalambos Themistocleous A Study of Standard Modern Greek and Cypriot Greek Stop Consonants Preliminary Findings 141
Anna Anastassiadis-Symeonidis Elisavet Kiourti amp Maria MitsiakiInflectional Morphology at the service of Lexicography ΚΟΜOΛεξ A Cypriot Mοrphological Dictionary 157
Γεωργία Ανδρέου amp Ματίνα ΤασιούδηΗ ανάπτυξη του λεξιλογίου σε παιδιά με Σύνδρομο Απνοιών στον Ύπνο 175
Ανθούλα- Ελευθερία Ανδρεσάκη Ιατρικές μεταφορές στον δημοσιογραφικό λόγο της κρίσης Η οπτική γωνία των Γερμανών 187
Μαρία ΑνδριάΠροσεγγίζοντας θέματα Διαγλωσσικής Επίδρασης μέσα από το πλαίσιο της Γνωσιακής Γλωσσολογίας ένα παράδειγμα από την κατάκτηση της Ελληνικής ως Γ2 199
Spyros Armostis amp Kakia PetinouMastering word-initial syllable onsets by Cypriot Greek toddlers with and without early language delay 215
Julia Bacskai-AtkariAmbiguity and the Internal Structure of Comparative Complements in Greek 231
Costas CanakisTalking about same-sex parenthood in contemporary Greece Dynamic categorization and indexicality 243
Michael ChiouThe pragmatics of future tense in Greek 257
Maria Chondrogianni The Pragmatics of the Modern Greek Segmental Μarkers 269
Katerina Christopoulou George J Xydopoulos ampAnastasios TsangalidisGrammatical gender and offensiveness in Modern Greek slang vocabulary 291
Aggeliki Fotopoulou vasiliki Foufi Tita Kyriacopoulou amp Claude Martineau Extraction of complex text segments in Modern Greek 307
Aγγελική Φωτοπούλου amp Βούλα ΓιούληΑπό την laquoΈκφρασηraquo στο laquoΠολύτροποraquo σχεδιασμός και οργάνωση ενός εννοιολογικού λεξικού 327
Marianthi Georgalidou Sofia lampropoulou Maria Gasouka Apostolos Kostas amp Xan-thippi FoulidildquoLearn grammarrdquo Sexist language and ideology in a corpus of Greek Public Documents 341
Maria Giagkou Giorgos Fragkakis Dimitris Pappas amp Harris PapageorgiouFeature extraction and analysis in Greek L2 texts in view of automatic labeling for proficiency levels 357
Dionysis Goutsos Georgia Fragaki Irene Florou vasiliki Kakousi amp Paraskevi SavvidouThe Diachronic Corpus of Greek of the 20th century Design and compilation 369
Kleanthes K Grohmann amp Maria KambanarosBilectalism Comparative Bilingualism and theGradience of Multilingualism A View from Cyprus 383
Guumlnther S Henrich bdquoΓεωγραφία νεωτερικήldquo στο Λίβιστρος και Ροδάμνη μετατόπιση ονομάτων βαλτικών χωρών προς την Ανατολή 397
Noriyo Hoozawa-Arkenau amp Christos KarvounisVergleichende Diglossie - Aspekte im Japanischen und Neugriechischen Verietaumlten - Interferenz 405
Μαρία Ιακώβου Ηριάννα Βασιλειάδη-Λιναρδάκη Φλώρα Βλάχου Όλγα Δήμα Μαρία Καββαδία Τατιάνα Κατσίνα Μαρίνα Κουτσουμπού Σοφία-Νεφέλη Κύτρου χριστίνα Κωστάκου Φρόσω Παππά amp Σταυριαλένα ΠερρέαΣΕΠΑΜΕ2 Μια καινούρια πηγή αναφοράς για την Ελληνική ως Γ2 419
Μαρία Ιακώβου amp Θωμαΐς ΡουσουλιώτηΒασικές αρχές σχεδιασμού και ανάπτυξης του νέου μοντέλου αναλυτικών προγραμμάτων για τη διδασκαλία της Eλληνικής ως δεύτερηςξένης γλώσσας 433
Μαρία Καμηλάκη laquoΜαζί μου ασχολείσαι πόσο μαλάκας είσαιraquo Λέξεις-ταμπού και κοινωνιογλωσσικές ταυτότητες στο σύγχρονο ελληνόφωνο τραγούδι 449
Μαρία Καμηλάκη Γεωργία Κατσούδα amp Μαρία Βραχιονίδου Η εννοιολογική μεταφορά σε λέξεις-ταμπού της ΝΕΚ και των νεοελληνικών διαλέκτων 465
Eleni Karantzola Georgios Mikros amp Anastassios Papaioannou Lexico-grammatical variation and stylometric profile of autograph texts in Early Modern Greek 479
Sviatlana Karpava Maria Kambanaros amp Kleanthes K GrohmannNarrative Abilities MAINing RussianndashGreek Bilingual Children in Cyprus 493
χρήστος Καρβούνης Γλωσσικός εξαρχαϊσμός και laquoιδεολογικήraquo νόρμα Ζητήματα γλωσσικής διαχείρισης στη νέα ελληνική 507
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
ΣΗΜΕΙΩΜΑ ΕΚΔΟΤΩΝ
Το 12ο Διεθνές Συνέδριο Ελληνικής Γλωσσολογίας (International Conference on Greek linguisticsICGl12) πραγματοποιήθηκε στο Κέντρο Νέου Ελληνισμού του Ελεύθερου Πανεπιστημίου του Βερολίνου (Centrum Modernes Griechenland Freie Universitaumlt Berlin) στις 16-19 Σεπτεμβρίου 2015 με τη συμμετοχή περίπου τετρακοσί-ων συνέδρων απrsquo όλον τον κόσμο
Την Επιστημονική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπου-λος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Δώρα Αλεξοπούλου Γιάννης Ανδρουτσόπουλος Αμαλία Αρβανίτη Σταύρος Ασημακόπου-λος Αλεξάνδρα Γεωργακοπούλου Κλεάνθης Γκρώμαν Σαβίνα Ιατρίδου Mark Janse Brian Joseph Αλέξης Καλοκαιρινός Ναπολέων Κάτσος Ευαγγελία Κορδώνη Αμα-λία Μόζερ Ελένη Μπουτουλούση Κική Νικηφορίδου Αγγελική Ράλλη Άννα Ρούσ-σου Αθηνά Σιούπη Σταύρος Σκοπετέας Κατερίνα Στάθη Μελίτα Σταύρου Αρχόντω Τερζή Νίνα Τοπιντζή Ιάνθη Τσιμπλή και Σταυρούλα Τσιπλάκου
Την Οργανωτική Επιτροπή του ICGl12 στελέχωσαν οι Θανάσης Γεωργακόπουλος Αλέξης Καλοκαιρινός Κώστας Κοσμάς Θεοδοσία-Σούλα Παυλίδου και Μίλτος Πε-χλιβάνος
Οι δύο τόμοι των πρακτικών του συνεδρίου είναι προϊόν της εργασίας της Εκδο-τικής Επιτροπής στην οποία συμμετείχαν οι Θανάσης Γεωργακόπουλος Θεοδοσία-Σούλα Παυλίδου Μίλτος Πεχλιβάνος Άρτεμις Αλεξιάδου Γιάννης Ανδρουτσόπου-λος Αλέξης Καλοκαιρινός Σταύρος Σκοπετέας και Κατερίνα Στάθη
Παρότι στο συνέδριο οι ανακοινώσεις είχαν ταξινομηθεί σύμφωνα με θεματικούς άξονες τα κείμενα των ανακοινώσεων παρατίθενται σε αλφαβητική σειρά σύμφωνα με το λατινικό αλφάβητο εξαίρεση αποτελούν οι εναρκτήριες ομιλίες οι οποίες βρί-σκονται στην αρχή του πρώτου τόμου
Η Οργανωτική Επιτροπή του ICGl12
ΠΕΡΙΕχΟΜΕΝΑ
Σημείωμα εκδοτών 7
Περιεχόμενα 9
Peter MackridgeSome literary representations of spoken Greek before nationalism(1750-1801) 17
Μαρία ΣηφιανούΗ έννοια της ευγένειας στα Eλληνικά 45
Σπυριδούλα Βαρλοκώστα Syntactic comprehension in aphasia and its relationship to working memory deficits 75
Ευαγγελία Αχλάδη Αγγελική Δούρη Ευγενία Μαλικούτη amp χρυσάνθη Παρασχάκη-ΜπαράνΓλωσσικά λάθη τουρκόφωνων μαθητών της Ελληνικής ως ξένηςδεύτερης γλώσσας Ανάλυση και διδακτική αξιοποίηση 109
Κατερίνα ΑλεξανδρήΗ μορφή και η σημασία της διαβάθμισης στα επίθετα που δηλώνουν χρώμα 125
Eva Anastasi Ageliki logotheti Stavri Panayiotou Marilena Serafim amp Charalambos Themistocleous A Study of Standard Modern Greek and Cypriot Greek Stop Consonants Preliminary Findings 141
Anna Anastassiadis-Symeonidis Elisavet Kiourti amp Maria MitsiakiInflectional Morphology at the service of Lexicography ΚΟΜOΛεξ A Cypriot Mοrphological Dictionary 157
Γεωργία Ανδρέου amp Ματίνα ΤασιούδηΗ ανάπτυξη του λεξιλογίου σε παιδιά με Σύνδρομο Απνοιών στον Ύπνο 175
Ανθούλα- Ελευθερία Ανδρεσάκη Ιατρικές μεταφορές στον δημοσιογραφικό λόγο της κρίσης Η οπτική γωνία των Γερμανών 187
Μαρία ΑνδριάΠροσεγγίζοντας θέματα Διαγλωσσικής Επίδρασης μέσα από το πλαίσιο της Γνωσιακής Γλωσσολογίας ένα παράδειγμα από την κατάκτηση της Ελληνικής ως Γ2 199
Spyros Armostis amp Kakia PetinouMastering word-initial syllable onsets by Cypriot Greek toddlers with and without early language delay 215
Julia Bacskai-AtkariAmbiguity and the Internal Structure of Comparative Complements in Greek 231
Costas CanakisTalking about same-sex parenthood in contemporary Greece Dynamic categorization and indexicality 243
Michael ChiouThe pragmatics of future tense in Greek 257
Maria Chondrogianni The Pragmatics of the Modern Greek Segmental Μarkers 269
Katerina Christopoulou George J Xydopoulos ampAnastasios TsangalidisGrammatical gender and offensiveness in Modern Greek slang vocabulary 291
Aggeliki Fotopoulou vasiliki Foufi Tita Kyriacopoulou amp Claude Martineau Extraction of complex text segments in Modern Greek 307
Aγγελική Φωτοπούλου amp Βούλα ΓιούληΑπό την laquoΈκφρασηraquo στο laquoΠολύτροποraquo σχεδιασμός και οργάνωση ενός εννοιολογικού λεξικού 327
Marianthi Georgalidou Sofia lampropoulou Maria Gasouka Apostolos Kostas amp Xan-thippi FoulidildquoLearn grammarrdquo Sexist language and ideology in a corpus of Greek Public Documents 341
Maria Giagkou Giorgos Fragkakis Dimitris Pappas amp Harris PapageorgiouFeature extraction and analysis in Greek L2 texts in view of automatic labeling for proficiency levels 357
Dionysis Goutsos Georgia Fragaki Irene Florou vasiliki Kakousi amp Paraskevi SavvidouThe Diachronic Corpus of Greek of the 20th century Design and compilation 369
Kleanthes K Grohmann amp Maria KambanarosBilectalism Comparative Bilingualism and theGradience of Multilingualism A View from Cyprus 383
Guumlnther S Henrich bdquoΓεωγραφία νεωτερικήldquo στο Λίβιστρος και Ροδάμνη μετατόπιση ονομάτων βαλτικών χωρών προς την Ανατολή 397
Noriyo Hoozawa-Arkenau amp Christos KarvounisVergleichende Diglossie - Aspekte im Japanischen und Neugriechischen Verietaumlten - Interferenz 405
Μαρία Ιακώβου Ηριάννα Βασιλειάδη-Λιναρδάκη Φλώρα Βλάχου Όλγα Δήμα Μαρία Καββαδία Τατιάνα Κατσίνα Μαρίνα Κουτσουμπού Σοφία-Νεφέλη Κύτρου χριστίνα Κωστάκου Φρόσω Παππά amp Σταυριαλένα ΠερρέαΣΕΠΑΜΕ2 Μια καινούρια πηγή αναφοράς για την Ελληνική ως Γ2 419
Μαρία Ιακώβου amp Θωμαΐς ΡουσουλιώτηΒασικές αρχές σχεδιασμού και ανάπτυξης του νέου μοντέλου αναλυτικών προγραμμάτων για τη διδασκαλία της Eλληνικής ως δεύτερηςξένης γλώσσας 433
Μαρία Καμηλάκη laquoΜαζί μου ασχολείσαι πόσο μαλάκας είσαιraquo Λέξεις-ταμπού και κοινωνιογλωσσικές ταυτότητες στο σύγχρονο ελληνόφωνο τραγούδι 449
Μαρία Καμηλάκη Γεωργία Κατσούδα amp Μαρία Βραχιονίδου Η εννοιολογική μεταφορά σε λέξεις-ταμπού της ΝΕΚ και των νεοελληνικών διαλέκτων 465
Eleni Karantzola Georgios Mikros amp Anastassios Papaioannou Lexico-grammatical variation and stylometric profile of autograph texts in Early Modern Greek 479
Sviatlana Karpava Maria Kambanaros amp Kleanthes K GrohmannNarrative Abilities MAINing RussianndashGreek Bilingual Children in Cyprus 493
χρήστος Καρβούνης Γλωσσικός εξαρχαϊσμός και laquoιδεολογικήraquo νόρμα Ζητήματα γλωσσικής διαχείρισης στη νέα ελληνική 507
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
ΠΕΡΙΕχΟΜΕΝΑ
Σημείωμα εκδοτών 7
Περιεχόμενα 9
Peter MackridgeSome literary representations of spoken Greek before nationalism(1750-1801) 17
Μαρία ΣηφιανούΗ έννοια της ευγένειας στα Eλληνικά 45
Σπυριδούλα Βαρλοκώστα Syntactic comprehension in aphasia and its relationship to working memory deficits 75
Ευαγγελία Αχλάδη Αγγελική Δούρη Ευγενία Μαλικούτη amp χρυσάνθη Παρασχάκη-ΜπαράνΓλωσσικά λάθη τουρκόφωνων μαθητών της Ελληνικής ως ξένηςδεύτερης γλώσσας Ανάλυση και διδακτική αξιοποίηση 109
Κατερίνα ΑλεξανδρήΗ μορφή και η σημασία της διαβάθμισης στα επίθετα που δηλώνουν χρώμα 125
Eva Anastasi Ageliki logotheti Stavri Panayiotou Marilena Serafim amp Charalambos Themistocleous A Study of Standard Modern Greek and Cypriot Greek Stop Consonants Preliminary Findings 141
Anna Anastassiadis-Symeonidis Elisavet Kiourti amp Maria MitsiakiInflectional Morphology at the service of Lexicography ΚΟΜOΛεξ A Cypriot Mοrphological Dictionary 157
Γεωργία Ανδρέου amp Ματίνα ΤασιούδηΗ ανάπτυξη του λεξιλογίου σε παιδιά με Σύνδρομο Απνοιών στον Ύπνο 175
Ανθούλα- Ελευθερία Ανδρεσάκη Ιατρικές μεταφορές στον δημοσιογραφικό λόγο της κρίσης Η οπτική γωνία των Γερμανών 187
Μαρία ΑνδριάΠροσεγγίζοντας θέματα Διαγλωσσικής Επίδρασης μέσα από το πλαίσιο της Γνωσιακής Γλωσσολογίας ένα παράδειγμα από την κατάκτηση της Ελληνικής ως Γ2 199
Spyros Armostis amp Kakia PetinouMastering word-initial syllable onsets by Cypriot Greek toddlers with and without early language delay 215
Julia Bacskai-AtkariAmbiguity and the Internal Structure of Comparative Complements in Greek 231
Costas CanakisTalking about same-sex parenthood in contemporary Greece Dynamic categorization and indexicality 243
Michael ChiouThe pragmatics of future tense in Greek 257
Maria Chondrogianni The Pragmatics of the Modern Greek Segmental Μarkers 269
Katerina Christopoulou George J Xydopoulos ampAnastasios TsangalidisGrammatical gender and offensiveness in Modern Greek slang vocabulary 291
Aggeliki Fotopoulou vasiliki Foufi Tita Kyriacopoulou amp Claude Martineau Extraction of complex text segments in Modern Greek 307
Aγγελική Φωτοπούλου amp Βούλα ΓιούληΑπό την laquoΈκφρασηraquo στο laquoΠολύτροποraquo σχεδιασμός και οργάνωση ενός εννοιολογικού λεξικού 327
Marianthi Georgalidou Sofia lampropoulou Maria Gasouka Apostolos Kostas amp Xan-thippi FoulidildquoLearn grammarrdquo Sexist language and ideology in a corpus of Greek Public Documents 341
Maria Giagkou Giorgos Fragkakis Dimitris Pappas amp Harris PapageorgiouFeature extraction and analysis in Greek L2 texts in view of automatic labeling for proficiency levels 357
Dionysis Goutsos Georgia Fragaki Irene Florou vasiliki Kakousi amp Paraskevi SavvidouThe Diachronic Corpus of Greek of the 20th century Design and compilation 369
Kleanthes K Grohmann amp Maria KambanarosBilectalism Comparative Bilingualism and theGradience of Multilingualism A View from Cyprus 383
Guumlnther S Henrich bdquoΓεωγραφία νεωτερικήldquo στο Λίβιστρος και Ροδάμνη μετατόπιση ονομάτων βαλτικών χωρών προς την Ανατολή 397
Noriyo Hoozawa-Arkenau amp Christos KarvounisVergleichende Diglossie - Aspekte im Japanischen und Neugriechischen Verietaumlten - Interferenz 405
Μαρία Ιακώβου Ηριάννα Βασιλειάδη-Λιναρδάκη Φλώρα Βλάχου Όλγα Δήμα Μαρία Καββαδία Τατιάνα Κατσίνα Μαρίνα Κουτσουμπού Σοφία-Νεφέλη Κύτρου χριστίνα Κωστάκου Φρόσω Παππά amp Σταυριαλένα ΠερρέαΣΕΠΑΜΕ2 Μια καινούρια πηγή αναφοράς για την Ελληνική ως Γ2 419
Μαρία Ιακώβου amp Θωμαΐς ΡουσουλιώτηΒασικές αρχές σχεδιασμού και ανάπτυξης του νέου μοντέλου αναλυτικών προγραμμάτων για τη διδασκαλία της Eλληνικής ως δεύτερηςξένης γλώσσας 433
Μαρία Καμηλάκη laquoΜαζί μου ασχολείσαι πόσο μαλάκας είσαιraquo Λέξεις-ταμπού και κοινωνιογλωσσικές ταυτότητες στο σύγχρονο ελληνόφωνο τραγούδι 449
Μαρία Καμηλάκη Γεωργία Κατσούδα amp Μαρία Βραχιονίδου Η εννοιολογική μεταφορά σε λέξεις-ταμπού της ΝΕΚ και των νεοελληνικών διαλέκτων 465
Eleni Karantzola Georgios Mikros amp Anastassios Papaioannou Lexico-grammatical variation and stylometric profile of autograph texts in Early Modern Greek 479
Sviatlana Karpava Maria Kambanaros amp Kleanthes K GrohmannNarrative Abilities MAINing RussianndashGreek Bilingual Children in Cyprus 493
χρήστος Καρβούνης Γλωσσικός εξαρχαϊσμός και laquoιδεολογικήraquo νόρμα Ζητήματα γλωσσικής διαχείρισης στη νέα ελληνική 507
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
Γεωργία Ανδρέου amp Ματίνα ΤασιούδηΗ ανάπτυξη του λεξιλογίου σε παιδιά με Σύνδρομο Απνοιών στον Ύπνο 175
Ανθούλα- Ελευθερία Ανδρεσάκη Ιατρικές μεταφορές στον δημοσιογραφικό λόγο της κρίσης Η οπτική γωνία των Γερμανών 187
Μαρία ΑνδριάΠροσεγγίζοντας θέματα Διαγλωσσικής Επίδρασης μέσα από το πλαίσιο της Γνωσιακής Γλωσσολογίας ένα παράδειγμα από την κατάκτηση της Ελληνικής ως Γ2 199
Spyros Armostis amp Kakia PetinouMastering word-initial syllable onsets by Cypriot Greek toddlers with and without early language delay 215
Julia Bacskai-AtkariAmbiguity and the Internal Structure of Comparative Complements in Greek 231
Costas CanakisTalking about same-sex parenthood in contemporary Greece Dynamic categorization and indexicality 243
Michael ChiouThe pragmatics of future tense in Greek 257
Maria Chondrogianni The Pragmatics of the Modern Greek Segmental Μarkers 269
Katerina Christopoulou George J Xydopoulos ampAnastasios TsangalidisGrammatical gender and offensiveness in Modern Greek slang vocabulary 291
Aggeliki Fotopoulou vasiliki Foufi Tita Kyriacopoulou amp Claude Martineau Extraction of complex text segments in Modern Greek 307
Aγγελική Φωτοπούλου amp Βούλα ΓιούληΑπό την laquoΈκφρασηraquo στο laquoΠολύτροποraquo σχεδιασμός και οργάνωση ενός εννοιολογικού λεξικού 327
Marianthi Georgalidou Sofia lampropoulou Maria Gasouka Apostolos Kostas amp Xan-thippi FoulidildquoLearn grammarrdquo Sexist language and ideology in a corpus of Greek Public Documents 341
Maria Giagkou Giorgos Fragkakis Dimitris Pappas amp Harris PapageorgiouFeature extraction and analysis in Greek L2 texts in view of automatic labeling for proficiency levels 357
Dionysis Goutsos Georgia Fragaki Irene Florou vasiliki Kakousi amp Paraskevi SavvidouThe Diachronic Corpus of Greek of the 20th century Design and compilation 369
Kleanthes K Grohmann amp Maria KambanarosBilectalism Comparative Bilingualism and theGradience of Multilingualism A View from Cyprus 383
Guumlnther S Henrich bdquoΓεωγραφία νεωτερικήldquo στο Λίβιστρος και Ροδάμνη μετατόπιση ονομάτων βαλτικών χωρών προς την Ανατολή 397
Noriyo Hoozawa-Arkenau amp Christos KarvounisVergleichende Diglossie - Aspekte im Japanischen und Neugriechischen Verietaumlten - Interferenz 405
Μαρία Ιακώβου Ηριάννα Βασιλειάδη-Λιναρδάκη Φλώρα Βλάχου Όλγα Δήμα Μαρία Καββαδία Τατιάνα Κατσίνα Μαρίνα Κουτσουμπού Σοφία-Νεφέλη Κύτρου χριστίνα Κωστάκου Φρόσω Παππά amp Σταυριαλένα ΠερρέαΣΕΠΑΜΕ2 Μια καινούρια πηγή αναφοράς για την Ελληνική ως Γ2 419
Μαρία Ιακώβου amp Θωμαΐς ΡουσουλιώτηΒασικές αρχές σχεδιασμού και ανάπτυξης του νέου μοντέλου αναλυτικών προγραμμάτων για τη διδασκαλία της Eλληνικής ως δεύτερηςξένης γλώσσας 433
Μαρία Καμηλάκη laquoΜαζί μου ασχολείσαι πόσο μαλάκας είσαιraquo Λέξεις-ταμπού και κοινωνιογλωσσικές ταυτότητες στο σύγχρονο ελληνόφωνο τραγούδι 449
Μαρία Καμηλάκη Γεωργία Κατσούδα amp Μαρία Βραχιονίδου Η εννοιολογική μεταφορά σε λέξεις-ταμπού της ΝΕΚ και των νεοελληνικών διαλέκτων 465
Eleni Karantzola Georgios Mikros amp Anastassios Papaioannou Lexico-grammatical variation and stylometric profile of autograph texts in Early Modern Greek 479
Sviatlana Karpava Maria Kambanaros amp Kleanthes K GrohmannNarrative Abilities MAINing RussianndashGreek Bilingual Children in Cyprus 493
χρήστος Καρβούνης Γλωσσικός εξαρχαϊσμός και laquoιδεολογικήraquo νόρμα Ζητήματα γλωσσικής διαχείρισης στη νέα ελληνική 507
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
Dionysis Goutsos Georgia Fragaki Irene Florou vasiliki Kakousi amp Paraskevi SavvidouThe Diachronic Corpus of Greek of the 20th century Design and compilation 369
Kleanthes K Grohmann amp Maria KambanarosBilectalism Comparative Bilingualism and theGradience of Multilingualism A View from Cyprus 383
Guumlnther S Henrich bdquoΓεωγραφία νεωτερικήldquo στο Λίβιστρος και Ροδάμνη μετατόπιση ονομάτων βαλτικών χωρών προς την Ανατολή 397
Noriyo Hoozawa-Arkenau amp Christos KarvounisVergleichende Diglossie - Aspekte im Japanischen und Neugriechischen Verietaumlten - Interferenz 405
Μαρία Ιακώβου Ηριάννα Βασιλειάδη-Λιναρδάκη Φλώρα Βλάχου Όλγα Δήμα Μαρία Καββαδία Τατιάνα Κατσίνα Μαρίνα Κουτσουμπού Σοφία-Νεφέλη Κύτρου χριστίνα Κωστάκου Φρόσω Παππά amp Σταυριαλένα ΠερρέαΣΕΠΑΜΕ2 Μια καινούρια πηγή αναφοράς για την Ελληνική ως Γ2 419
Μαρία Ιακώβου amp Θωμαΐς ΡουσουλιώτηΒασικές αρχές σχεδιασμού και ανάπτυξης του νέου μοντέλου αναλυτικών προγραμμάτων για τη διδασκαλία της Eλληνικής ως δεύτερηςξένης γλώσσας 433
Μαρία Καμηλάκη laquoΜαζί μου ασχολείσαι πόσο μαλάκας είσαιraquo Λέξεις-ταμπού και κοινωνιογλωσσικές ταυτότητες στο σύγχρονο ελληνόφωνο τραγούδι 449
Μαρία Καμηλάκη Γεωργία Κατσούδα amp Μαρία Βραχιονίδου Η εννοιολογική μεταφορά σε λέξεις-ταμπού της ΝΕΚ και των νεοελληνικών διαλέκτων 465
Eleni Karantzola Georgios Mikros amp Anastassios Papaioannou Lexico-grammatical variation and stylometric profile of autograph texts in Early Modern Greek 479
Sviatlana Karpava Maria Kambanaros amp Kleanthes K GrohmannNarrative Abilities MAINing RussianndashGreek Bilingual Children in Cyprus 493
χρήστος Καρβούνης Γλωσσικός εξαρχαϊσμός και laquoιδεολογικήraquo νόρμα Ζητήματα γλωσσικής διαχείρισης στη νέα ελληνική 507
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
Demetra Katis amp Kiki Nikiforidou Spatial prepositions in early child GreekImplications for acquisition polysemy and historical change 525
Γεωργία Κατσούδα Το επίθημα -ούνα στη ΝΕΚ και στις νεοελληνικές διαλέκτους και ιδιώματα 539
George Kotzoglou Sub-extraction from subjects in Greek Its existence its locus and an open issue 555
veranna KypriotiNarrative identity and age the case of the bilingual in Greek and Turkish Muslim community of Rhodes Greece 571
χριστίνα Λύκου Η Ελλάδα στην Ευρώπη της κρίσης Αναπαραστάσεις στον ελληνικό δημοσιογραφικό λόγο 583
Nikos liosis Systems in disruption Propontis Tsakonian 599
Katerina Magdou Sam Featherston Resumptive Pronouns can be more acceptable than gaps Experimental evidence from Greek 613
Maria Margarita Makri Opos identity comparatives in Greek an experimental investigation 629
2ος Τόμος
Περιεχόμενα 651
vasiliki Makri Gender assignment to Romance loans in Katoitalioacutetika a case study of contact morphology 659
Evgenia Malikouti Usage Labels of Turkish Loanwords in three Modern Greek Dictionaries 675
Persephone Mamoukari amp Penelope Kambakis-vougiouklis Frequency and Effectiveness of Strategy Use in SILL questionnaire using an Innovative Electronic Application 693
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
Georgia Maniati voula Gotsoulia amp Stella Markantonatou Contrasting the Conceptual Lexicon of ILSP (CL-ILSP) with major lexicographic examples 709
Γεώργιος Μαρκόπουλος amp Αθανάσιος Καρασίμος Πολυεπίπεδη επισημείωση του Ελληνικού Σώματος Κειμένων Αφασικού Λόγου 725
Πωλίνα Μεσηνιώτη Κατερίνα Πούλιου amp χριστόφορος Σουγανίδης Μορφοσυντακτικά λάθη μαθητών Τάξεων Υποδοχής που διδάσκονται την Ελληνική ως Γ2 741
Stamatia Michalopoulou Third Language Acquisition The Pro-Drop-Parameter in the Interlanguage of Greek students of German 759
vicky Nanousi amp Arhonto Terzi Non-canonical sentences in agrammatism the case of Greek passives 773
Καλομοίρα Νικολού Μαρία Ξεφτέρη amp Νίτσα Παραχεράκη Τo φαινόμενο της σύνθεσης λέξεων στην κυκλαδοκρητική διαλεκτική ομάδα 789
Ελένη Παπαδάμου amp Δώρης Κ Κυριαζής Μορφές διαβαθμιστικής αναδίπλωσης στην ελληνική και στις άλλες βαλκανικές γλώσσες 807
Γεράσιμος Σοφοκλής Παπαδόπουλος Το δίπολο laquoΕμείς και οι Άλλοιraquo σε σχόλια αναγνωστών της Lifo σχετικά με τη Χρυσή Αυγή 823
Ελένη Παπαδοπούλου Η συνδυαστικότητα υποκοριστικών επιθημάτων με β συνθετικό το επίθημα -άκι στον διαλεκτικό λόγο 839
Στέλιος Πιπερίδης Πένυ Λαμπροπούλου amp Μαρία Γαβριηλίδου clarinel Υποδομή τεκμηρίωσης διαμοιρασμού και επεξεργασίας γλωσσικών δεδομένων 851
Maria Pontiki Opinion Mining and Target Extraction in Greek Review Texts 871
Anna Roussou The duality of mipos 885
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
Stathis Selimis amp Demetra Katis Reference to static space in Greek A cross-linguistic and developmental perspective of poster descriptions 897
Evi Sifaki amp George Tsoulas XP-V orders in Greek 911
Konstantinos Sipitanos On desiderative constructions in Naousa dialect 923
Eleni Staraki Future in Greek A Degree Expression 935
χριστίνα Τακούδα amp Ευανθία Παπαευθυμίου Συγκριτικές διδακτικές πρακτικές στη διδασκαλία της ελληνικής ως Γ2 από την κριτική παρατήρηση στην αναπλαισίωση 945
Alexandros Tantos Giorgos Chatziioannidis Katerina lykou Meropi Papatheohari Antonia Samara amp Kostas vlachos Corpus C58 and the interface between intra- and inter-sentential linguistic information 961
Arhonto Terzi amp vina TsakaliΤhe contribution of Greek SE in the development of locatives 977
Paraskevi ThomouConceptual and lexical aspects influencing metaphor realization in Modern Greek 993
Nina Topintzi amp Stuart Davis Features and Asymmetries of Edge Geminates 1007
liana Tronci At the lexicon-syntax interface Ancient Greek constructions with ἔχειν and psychological nouns 1021
Βίλλυ Τσάκωνα laquoΔημοκρατία είναι 4 λύκοι και 1 πρόβατο να ψηφίζουν για φαγητόraquoΑναλύοντας τα ανέκδοτα για τουςτις πολιτικούς στην οικονομική κρίση 1035
Ειρήνη Τσαμαδού- Jacoberger amp Μαρία ΖέρβαΕκμάθηση ελληνικών στο Πανεπιστήμιο Στρασβούργου κίνητρα και αναπαραστάσεις 1051
Stavroula Tsiplakou amp Spyros Armostis Do dialect variants (mis)behave Evidence from the Cypriot Greek koine 1065
Αγγελική Τσόκογλου amp Σύλα Κλειδή Συζητώντας τις δομές σε -οντας 1077
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
Αλεξιάννα Τσότσου Η μεθοδολογική προσέγγιση της εικόνας της Γερμανίας στις ελληνικές εφημερίδες 1095
Anastasia Tzilinis Begruumlndendes Handeln im neugriechischen Wissenschaftlichen Artikel Die Situierung des eigenen Beitrags im Forschungszusammenhang 1109
Kυριακούλα Τζωρτζάτου Aργύρης Αρχάκης Άννα Ιορδανίδου amp Γιώργος Ι Ξυδόπουλος Στάσεις απέναντι στην ορθογραφία της Κοινής Νέας Ελληνικής Ζητήματα ερευνητικού σχεδιασμού 1123
Nicole vassalou Dimitris Papazachariou amp Mark Janse The Vowel System of Mišoacutetika Cappadocian 1139
Marina vassiliou Angelos Georgaras Prokopis Prokopidis amp Haris Papageorgiou Co-referring or not co-referring Answer the question 1155
Jeroen vis The acquisition of Ancient Greek vocabulary 1171
Christos vlachos Mod(aliti)es of lifting wh-questions 1187
Ευαγγελία Βλάχου amp Κατερίνα Φραντζή Μελέτη της χρήσης των ποσοδεικτών λίγο-λιγάκι σε κείμενα πολιτικού λόγου 1201
Madeleine voga Τι μας διδάσκουν τα ρήματα της ΝΕ σχετικά με την επεξεργασία της μορφολογίας 1213
Werner voigtlaquoΣεληνάκι μου λαμπρό φέγγε μου να περπατώ hellipraquo oder warum es in dem bekannten lied nicht so sondern eben φεγγαράκι heiszligt und ngr φεγγάρι 1227
Μαρία Βραχιονίδου Υποκοριστικά επιρρήματα σε νεοελληνικές διαλέκτους και ιδιώματα 1241
Jeroen van de Weijer amp Marina TzakostaThe Status of Complex in Greek 1259
Theodoros Xioufis The pattern of the metaphor within metonymy in the figurative language of romantic love in modern Greek 1275
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 357
FEATURE EXTR ACTIoN AND ANAlYSIS IN GREEK l2 TEXT S IN vIEW oF AUToMATIC l ABElING FoR
PRoFICIENCY lEvElSMaria Giagkou1 Giorgos Fragkakis Dimitris Pappas1 amp Harris Papageorgiou1
1Institute for language and Speech Processing RC ATHENAmgiagkouilspgr fragakisschgr dpappasilspgr xarisilspgr
Περίληψη
Στο άρθρο διερευνάται ένα σύνολο γλωσσικών χαρακτηριστικών κειμένων που απευθύνο-νται σε μαθητές της Ελληνικής ως Γ2 και εξετάζεται η σχέση των εν λόγω χαρακτηριστικών με το επίπεδο γλωσσομάθειας για το οποίο θεωρούνται κατάλληλα τα κείμενα αυτά Στόχος είναι να διερευνηθεί ποια χαρακτηριστικά παρουσιάζουν επαρκή διακριτική ικανότητα μετα-ξύ των επιπέδων ώστε να αξιοποιηθούν σε μια προσέγγιση αυτόματης κατηγοριοποίησης σε επίπεδα γλωσσομάθειας Προς αυτό το σκοπό αξιοποιείται ένα σώμα κειμένων που συγκρο-τήθηκε από εγχειρίδια της Ελληνικής ως Γ2 Τα αποτελέσματα αναδεικνύουν τη σημαντική επίδραση μεταξύ άλλων χαρακτηριστικών που ποσοτικοποιούν την περιπλοκότητα των συντακτικών δέντρων εξαρτήσεων της γενικής πτώσης και των επιθετικών προσδιορισμών
Keywords L2 reading text complexity linguistic features proficiency levels automatic label-ling
1 introduction
The last two decades have seen increasing interest in modelling text difficulty ie read-ability Automatic readability estimation systems are intended to assess whether a text retrieved from a large collection such as a repository or the web is appropriate for a given group of readers according to their abilities in l1 or by taking into account the
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
358 | GIAGKoU ET Al
readersrsquo special needs (eg learning difficulties) Readability estimation is particularly relevant for second language (l2) learners as well From the l2 perspective the aim is to automatically identify or retrieve a text given the proficiency level of the learner or group of learners
To this end recent studies attempt to grade l2 texts according to proficiency levels in order to facilitate reading in l2 or as an aid to the selection of assessment material (eg Centre for the Greek language 2013 Tzimokas and Tantos 2014 Franccedilois and Fairon 2012 ott and Meurers 2010 Pilaacuten et al 2014 vajjala and Meurers 2012) In a similar approach the development of productive skills in l2 (mainly writing) is investigated in view of an automated evaluation of l2 writing (eg lu 2010 2011 vyatkina 2012 Giagkou et al 2015)
The long tradition of l1 readability assessment dating back to the early 20th cen-tury (see DuBay 2006) has bequeathed readability formulas (eg Flesch Reading Ease Score Flesch-Kincaid Grade Level Fog index SMOG etc) that assign a difficulty grade or level to a text by relying on surface linguistic features such as sentence and word length as simple proxies for syntactic complexity and vocabulary burden re-spectively More recently advances in NlP have boosted readability research That is new resources (electronically available texts) and new tools (taggers parsers semantic treebanks etc) have made it feasible to apply machine learning techniques in large training corpora and to quantify more thorough and linguistically sound text features Semantic and discourse features are investigated eg named entities (Barzilay amp lapa-ta 2008) and lexical cohesion (Pitler amp Nenkova 2008) Shallow syntactic complexity indicators such as average sentence length are combined with the height of syntactic trees (see also Heilman et al 2008) Instead of simple proxies of vocabulary burden N-gram language Models (lM) are used for predicting the grade level of texts (Callan and Eskenazi 2007 Petersen amp ostendorf 2009 Schwarm and ostendorf 2005)
In this paper we present an investigation of linguistic features of texts addressed to learners of Greek as a second language (l2) The goal of this study is to identify the textual properties that indicate the development of reading skills in Greek l2 with the aim of employing these properties as parameters for automatic proficiency level labelling The set of features investigated in the current study draws on the traditional readability research combined with NlP-enabled features and machine learning tech-niques for text classification as this merging was found to result in performance gain (Franccedilois amp Miltsakaki 2012)
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 359
The paper is organized as follows Section 2 provides information on the corpus used and the features identified selected and computed in order to form the dataset for the analysis In Section 3 the analysis applied on the features is presented and the results are analyzed We conclude with a summary of the main findings and their implications to the directions of future work in view of automatic proficiency level classification for Greek l2
2 datasets
21 Corpus
For the purposes of this investigation a Greek l2 text set that is labelled for proficiency levels in an objective and qualified way and can thus be considered as gold-standard deemed necessary Such dataset was retrieved from the Greek l2 textbooks published by the Centre of Intercultural and Migration Studies (EDIAMME) and freely avail-able online These textbooks are addressed to Greek migrants living abroad from pre-schoolers (aged 6) to 18 year-olds learning Greek as a second or foreign language EDIAMME employs five proficiency levels aligned to the Greek educational system grades and to CEFR levels (Council of Europe 2001) as presented in Table 1
Age school grade ediAMMe level
Language content
cefr level alignment
6 Preschool1 Pre-reading
reading A17 18 29 3
2Speaking and writing consolidation
A210 4
11 53
Further practice in speaking and writing
B112 6
13 74 Independent
writing B2 amp C114 815 9
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
360 | GIAGKoU ET Al
Table 1 | EDIAMME proficiency levels (Damanakis 2004 76) and their alignment to CEFR levels (EDIAMME 2014)
only prose texts were extracted from the textbooks while poems lyrics exercises and guidelines to the exercises were excluded The selected texts belong to different gen-res (mainly narrative descriptive expository and procedural) and types (letters an-nouncements instructions diary entry etc) Dialogues were also included as they are very frequently used as educational material in l2 textbooks though the rolename of the speaker was removed
The final corpus employed in this investigation comprises 753 texts and a total of 112169 tokens (Table 2) Each individual text inherited the proficiency level assigned to the textbook it was retrieved from eg a text drawn from a textbook labeled as level 5 was considered as addressed to level 5 learners1
grouped levels
ediAMMe levels
texts sentences tokens
1 (CEFR A1-A2)
1 24 136 720
2 295 4552 33636
2 (CEFR B1-C1)
3 108 1263 8780
4 147 2305 19272
3 (CEFR C2) 5 179 3356 49761totals 753 11612 112169
Table 2 | Corpus description
1 It should be noted that this decision imposes a degree of ldquonoiserdquo to the data as although a low level textbook is not expected to include a text addressed to higher levels the reverse is not equally unlikely Eg certain texts retrieved from a level 5 textbook can actually address lower level learners
16 105 Greek language
and literature C217 1118 12
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 361
The texts were automatically annotated for morphological types syntactic dependen-cies and phrase structure using the Institute for language and Speech Processing NlP tools pipeline (Prokopidis et al 2011 Prokopidis and Papageorgiou 2014)
22 Feature selection and computation
The set of features investigated as indices of the proficiency level was selected on the basis of previous research on l1 and l2 readability assessment as well as on second language acquisition and development These features capture morphological syntac-tic lexicalsemantic and other attributes of the text that are salient to the target profi-ciency level discrimination and prediction task
In total 303 text features were identified and computed These fall grossly into the following categories
a) surface features word and sentence length (eg average word length) num-ber of characters punctuation marks numbers etc
b) Lexicalsemantic lexical density (ie content to functional words) lexical var-iation (eg typetoken ratio hapaxdis-legomena) including noun and verb variation measures text entropy lexical richness etc
c) Morphological frequencies and ratios of the different parts of speech includ-ing their forms eg ratio of passive verbs to verbs ratio of nouns in the geni-tive case to nouns ratio of 1st person personal pronouns to pronouns etc
d) syntactic frequencies and ratios of the different syntactic roles (eg subjects to verbs ratio) measures of the dependency trees (eg depth and height of syn-tactic trees) phrase structure (eg length of noun verb and adjectival phras-es) subordination and apposition (eg average number of coordinating and subordinating conjunctions per sentence) etc
e) discourse-based features eg use of relative pronouns as an index of the degree of anaphora density frequency of present and past tenses as indices of temporality and narrativity etc
The defined features were computed with a specialized software the IlSP FeatExt tool developed in Python The input of FeatExt is any corpus of Greek texts automatically annotated for Part of Speech syntactic dependencies and phrase structure It calcu-lates the values of raw surface features (frequencies of words sentences nouns verbs
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
362 | GIAGKoU ET Al
etc) and computes their standardized values (ie meaningful ratios) In order to cater for zero values MinMaxScaler transformation is applied to all raw features The output is a table of extracted feature values preferably in CSv format Settings can be modi-fied through an optional configuration file to define among others the set of features to be computed the corpus location or additional feature-relevant data such as a list of words to be counted (eg functional words basic vocabulary for a specific proficiency level or topic etc)
3 Analysis and results
In order to investigate the underlying associations of text features with the profi-ciency level correlation analysis was applied between all the extracted features and the grouped proficiency levels Table 3 reports the twenty features that exhibited the highest absolute values of Spearmanrsquo s rho correlation coefficient in descending order (plt005)
Among the best performing features the average number of noun phrases in the genitive case per sentence was found to exhibit the highest correlation coefficient (rho=0542) The association of the genitive case with the textrsquo s level is also evidenced by the performance of two more features ie the average number of adjectival phras-es in the genitive case per sentence (rho=0473) and the average length of adjectival phrases in the gen case (rho=0448) Complementing and looking at these results from a different angle the influence of phrase structure especially of the length and relative frequency of nominal phrases is apparent out of the 20 best performing features six are indices of phrase structure (features in ranks 1 6 8 12 15 and 16 in Table 3) The frequency of use of modifiers namely of adjectives also seems to be highly correlated to the proficiency level the more adjectives used in a text the more likely it is that the text is addressed to higher level learners This is evidenced by the average number of adjectival phrases and of adjectives per sentence
Another important finding is highlighted by the performance of features that at-tempt to quantify syntactic dependencies These include the width and height of de-pendency trees (rho=0495 and 0486 respectively) as well as the number of leafs and governor nodes (rho=0490 and 0485 respectively) Their emergence in the top ranks of Table 3 qualifies them as key predictors of the proficiency level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 363
Table 3 | Top-20 features highly correlated with EDIAMME grouped levels and post hoc multiple comparisons between level-pairs
feature spearmanrsquo s rho
ediAMMe grouped level-pairs1vs2 2vs3 1vs3
1 Av of Noun Phrases in gen case per sentence
0542
2 Av Width of dependency trees 0495
3 Av of leafs in dependency trees 0490
4 Av Height of dependency trees 0486
5 Av Sentence length 0485
6 Av of Adjectival Phrases per sentence 0485
7 Av of governor nodes in dependency trees 0485
8 Av of Noun Phrases per sentence 0480
9 of sentences with lengthgt20 words 0477
10 Av of Adjectives per sentence 0474
11 Av Word length 0474
12 Av of Adjectival Phrases in gen case per sentence
0473
13 of sentences with lengthgt10 words 0470
14 Terminal punctuation to total characters ratio
-0461
15 Av length of adjectival phrases in gen case
0448
16 Av of Adjectival Phrases in acc case per sentence
0446
17 of sentences with lengthgt30 words 0443
18 Av of Passive verbs per sentence 0442
19 Relative pronouns to Pronouns ratio 0439
20 Av of prepositions per sentence 0438
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
364 | GIAGKoU ET Al
Different aspects of syntactic complexity are also highlighted by the average number of passive verbs and prepositions per sentence As expected passive constructions are rarely used in lower levels while learners encounter them more and more frequently in textbooks as their reading skills develop The same is true for prepositions a feature that indicates that higher proficiency level texts employ more complex-compound sentences
The statistically significant correlation performed by the ratio of relative pronouns to pronouns (rho=0439) signifies the role of anaphora As anaphora resolution is considered a linguistically and cognitively demanding task during reading anaphoric structures are rare in lower levels but significantly more frequent in upper levels As a result the use of relative pronouns can be considered as a successful discriminator of proficiency levels
The list of the best performing features also includes some more ldquotraditionalrdquo indices of text complexity such as word and sentence length The average sentence length ap-pears in rank 5 in Table 3 (rho=0485) while relevant features that quantify sentence length from a different perspective are also present (the percentage of sentences with more than 10 20 and 30 words) Additionally the presence of the ratio of terminal punctuation to total characters should be also interpreted as an inverse to sentence length Regarding lexical features it is noticeable that among the various features in-vestigated (lexical diversity density etc) only the average word length is present in the top performers (rho=0474)
A more thorough investigation of the above features employed one-way ANovA for means comparison across levels which resulted in statistically significant main effects for all of the 20 features Since however this type of analysis cannot determine whether the mean values of a feature are statistically different between all possible level pairs post-hoc multiple comparisons (Bonferroni tests) were also applied The results are presented in Table 3 statistically different means for each feature are indicated for each level combination separately These comparisons indicate that all features can successfully discriminate group 3 (ie EDIAMME level 5 CEFR C2) from lower levels (both from group 2 and group 1) However some of the features were not as successful in discriminating group 1 (ie EDIAMME levels 1 and 2 CEFR A1 A2) from group 2 (ie EDIAMME levels 3 4 CEFR B1-C1) Poor performers in discriminating levels group 1 from group 2 were all the features relevant to sentence length with the excep-tion of the proportion of sentences with more than 20 words This implies that a group 1 text is unlikely to include lengthier sentences thus imposing a possible threshold for the transition from CEFR A2 to B1 level
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 365
4 conclusions and discussion
The current investigation highlighted a number of textual features automatically ex-tracted from a morphologically and syntactically annotated Greek l2 corpus With the aim of identifying indices of text difficulty that are directly associated with the proficiency level we employed statistical analysis and put forward the best perform-ing features These can be regarded as potential predictors of the proficiency level of a previously unseen text in an automatic labellingclassification approach
The results highlight the influence of syntactic features on the characterization of proficiency level with the exception of average word length the rest of the best per-forming features are directly or indirectly related to syntactic complexity This finding is in line with previous research where syntax-related features consistently appear in the best-performing prediction models (eg Pitler and Nenkova 2008 Schwarm and ostendorf 2005 Callan and Eskenazi 2007 Kate et al 2010 Kotani et al 2008) The frequencies of the genitive case of adjectives and prepositions were additionally iden-tified as successful discriminators Surface features used in traditional readability for-mulas such as sentence and word length were found to be significantly correlated to proficiency levels Similar recent research in Greek has also highlighted the influence of such surface features on proficiency level classification (Tzimokas and Tantos 2014) It is interesting to notice that some of the features put forward by Georgatou (2016) as the most informative ie sentence length passive verbs and adjectives are confirmed by the current study as well thus qualifying them as reliable of indices of Greek texts difficulty level
When the best performing features were tested for their discriminatory power be-tween all possible level pairs they proved to be highly discriminative of the upper proficiency level This finding implies a significant shift in l2 reading skills during the transition from C1 to C2 level and this shift can successfully be measured by the fea-tures investigated herein on the contrary the transition from A2 to B1 seems to go in hand with the acquisition of language skills not depicted in the features that emerged from the current analysis
It is true that the current investigation is subject to limitations imposed by the corpus at hand which comprised texts drawn from textbooks of a single publisher As such the findings may be influenced by the publisherrsquo s choices regarding the types and top-ics of texts and the linguistic descriptors of proficiency levels the editor has adopted To cater for this limitation the work described herein is continued and expanded in
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
366 | GIAGKoU ET Al
order to exploit a larger corpus of Greek l2 texts from different publishers Proficiency level labelling for this expanded corpus does not rely exclusively on the publisherrsquo s labelling Rather three independent experts in Greek l2 teaching have judged each text to determine the CEFR proficiency level The expertrsquo s judgements is treated as the dependent variable in a machine learning approach for the automatic labelling of previously unseen texts which has already yielded significant results
Reading comprehension is a key skill in l2 development and reading is an inte-gral part of l2 instruction and assessment In this view an automated approach to matching l2 learners to texts suitable for their proficiency level is expected to facilitate selection of reading material both for learners and teachers It is at the same time an anticipated aid in assessment procedures by providing an objective measurement for the estimation of level-appropriateness of items included in diagnostic placement or achievement language tests
references
Barzilay Regina and Mirella lapata 2008 ldquoModeling local Coherence An Entity-based Approachrdquo Computational Linguistics 34(1)1ndash34
Centre for the Greek language 2013 ldquologismiko Anagnosimotitasrdquo Accessed March 1 2017 httpwwwgreek-languagegrcertificationreadabi-lity
Council of Europe 2001 Common European Framework of Reference for Languages Learning Teaching Assessment (CEFR) wwwcoeintlang-CEFR
Damanakis Michalis ed 2004 Theoritiko Plaisio kai Programmata Spoudon gia tin Elli-noglossi Ekpaideusi sti Diaspora Rethymno EDIAMME httpwwwediammeedcuocgrdiaspora2indexphpid=23650010
DuBay William H 2006 The Classic Readability Studies Impact Information Costa Mesa California
EDIAMME 2014 Epipeda Glossomatheias kai Ekpaideutiko Yliko httpwwwediammeedcuocgrellinoglossiindexphpelekp-yliko-kepa
Franccedilois Thomas and Ceacutedrick Fairon 2012 ldquoAn ldquoAI readabilityrdquo Formula for French as a Foreign languagerdquo In Proceedings of the 2012 Joint Con-
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
FEATURE EXTRACTIoN AND ANAlYSIS IN GREEK | 367
ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 466ndash477 Association for Computational linguistics
Franccedilois Thomas and Eleni Miltsakaki 2012 ldquoDo NlP and Machine learning Im-prove Traditional Readability Formulasrdquo In Proceedings of the First Workshop on Predicting and improving text readabil-ity for target reader populations 49ndash57 Montreacuteal
Georgatou Spyridoula 2016 ldquoApproaching Readability Features in Greek School Booksrdquo Master thesis University of Tuumlbingen
Giagkou Maria Kantzou vicky Stamouli Spyridoula and Maria Tzevelekou 2015 ldquoDiscriminating CEFR levels in Greek l2 A Corpus-based Study of Young learnersrsquo Written Narrativesrdquo Bergen Lan-guage and Linguistics Studies 6153ndash169
Heilman Michael Kevyn Collins-Thompson and Maxine Eskenazi 2008 ldquoAn Ana-lysis of Statistical Models and Features for Reading Difficulty Predictionrdquo In The Third Workshop on Innovative Use of NLP for building Educational Applications Proceedings of the Work-shop 71ndash79 ACl
Callan Jamie and Maxine Eskenazi 2007 ldquoCombining lexical and Grammati-cal Features to Improve Readability Measures for First and Second language Textsrdquo In Proceedings of HLT-NAACLrsquo07 460ndash467 Association for Computational linguistics
lu Xiaofei 2010 ldquoAutomatic Analysis of Syntactic Complexity in Second lan-guage Writingrdquo International Journal of Corpus Linguistics 15(4)474ndash496
lu Xiaofei 2011 ldquoA Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-level ESl Writersrsquo language Develop-mentrdquo TESOL Quarterly 45(1)36ndash62
ott Niels and Detmar Meurers 2010 ldquoInformation Retrieval for Education Making Search Engines language Awarerdquo Themes in Science and Tech-nology Education 3(1ndash2)9ndash30
Petersen Sarah E and Mari ostendorf 2009 ldquoA Machine learning Approach to Reading level Assessmentrdquo Computer Speech and Language 23(1)89ndash106
Pilaacuten Ildikoacute volodina Elena and Richard Johansson 2014 ldquoRule-Based and Ma-chine learning Approaches for Second language Sentence-level Readabilityrdquo In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications 174ndash184 Association for Computational linguistics
Pitler Emily and Ani Nenkova 2008 ldquoRevisiting Readability A Unified Framework
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598
368 | GIAGKoU ET Al
for Predicting Text Qualityrdquo In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Language Processing 186ndash195 Honolulu ACl
Prokopidis Prokopis and Harris Papageorgiou 2014 ldquoExperiments for Dependency Parsing of Greekrdquo In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages 90ndash96 Dub-lin Ireland
Prokopidis Prokopis Georgantopoulos Byron and Harris Papageorgiou 2011 ldquoA suite of NlP tools for Greekrdquo In The 10 th International Confer-ence of Greek Linguistics 373ndash383 Komotini Greece
Schwarm Sarah E and Mari ostendorf 2005 ldquoReading level Assessment Using Sup port vector Machines and Statistical language Modelsrdquo In Proceed-ings of the 43 rd Annual Meeting of the Association for Computa-tional Linguistics (ACLrsquo05) 523ndash530 Ann Arbor Michigan
Tzimokas Dimitrios and Sotirios Tantos 2014 ldquologismiko Anagnosimotitas Ellinikon Keimenon Ena Neo Ergaleio gia ti Didaskalia tis Ellinikis os KsenisDeuteris Glossas kai tin Pistopoiisi Ellinomatheiasrdquo Paper presented at Diethnes Synedrio gia ti Didaskalia kai tin Pistopoiisi tis Ellinikis os KsenisDeuteris Glossas Thessaloniki october 25 httpspeakgreekgrelimagespdftzimwkaspdf
vajjala Sowmya and Detmar Meurers 2012 ldquoon Improving the Accuracy of Reada-bility Classification Using Insights from Second language Ac-quisitionrdquo In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP 163ndash173 Association for Computational linguistics
vyatkina Nina 2012 ldquoThe Development of Second language Writing Complexity in Groups and Individuals A longitudinal learner Corpus Studyrdquo The Modern Language Journal 96(4)576ndash598