+ All Categories
Home > Presentations & Public Speaking > Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction...

Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction...

Date post: 13-Apr-2017
Category:
Upload: scottish-language-dictionaries
View: 96 times
Download: 0 times
Share this document with a friend
22
Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond Dr Iztok Kosem Faculty of Arts, University of Ljubljana & Centre for Applied Linguistics, Trojina Institute
Transcript
Page 1: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Innovations in Slovenian (e-)lexicography:

from (semi-)automatic data extraction to crowdsourcing

and beyond

Dr Iztok Kosem

Faculty of Arts, University of Ljubljana &

Centre for Applied Linguistics, Trojina Institute

Page 2: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Lexicographical process (Klosa, 2013)

Page 3: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Born-digital dictionaries

• ANW (Dictionary of Contemporary Dutch)• 51079 entries (incl. partly complete entries)

• Innovative features (e.g. semagrams)

• Great Dictionary of Polish• A great deal of manual work included (Zmigrodzki 2014)

• Immediate release of final entries

• 15,000 entries in 5 years (not many examples!)

• Estonian collocations dictionary (Kallas et al. 2015)• Starting point: automatically extracted data

• Problems: examples extracted using a very general configuration; missing collocation clustering etc.

• Publication of the entire dictionary at the end

Page 4: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Dictionary situation in Slovenia

• Last comprehensive dictionary of Slovene published in 1991 (with many entries older, from 70s and 80s)• Based on material from late 19th century to 1970s• dictionary database not accessible (also question marks about its

usefulness)

• Second edition published in 2014• minor updates to the first edition (also opposing the conceptual

framework of the first version; Krek 2014; Ahlin et al 2014)• online version requires a purchase of a printed version• database is not available

• Dictionary publishing in general:• Commercial publishers closing dictionary departments (no new

projects)• General monolingual projects publicly funded

Page 5: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Dictionary of Contemporary Slovene Language• Challenges:

• Compiling a corpus-based dictionary from scratch, using state-of-the-art lexicographic methods and theoretical underpinnings

• Meeting needs of dictionary users (digital natives)

• Meeting the needs of NLP and language technology communities

• Communication in Slovene (2008-2013)• Gigafida corpus (1.2 billion words)

• New POS-tagger, parser and lexicon of word forms

• Slovene Lexical Database (Gantar et al. 2016)• Testing new methods and approaches

Page 6: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Lexicography and automation

• Which parts of dictionary entry can be

(semi-)automatically extracted:• List of words (e.g. terms)• New words (Cook et al. 2013)• Definitions (e.g. Pearson 1998; Pollak 2014) • Some types of labels (Rundell & Kilgarriff 2011)• Grammatical relations, collocations, multi-word

expressions (PARSEME COST Action)• Corpus examples (Kosem et al. 2013; Gantar et al. 2016;

Cook et al. 2014)

11

Page 7: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond
Page 8: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

authority (“manual” Sketch Grammar”)35 gramrels

authority (automatic Sketch Grammar)39 gramrels19 gramrels with 92 multi-word links (separate page)

Page 9: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

“it is more efficient to edit out the computer’s errors than to go through the whole data-selection process from the beginning”

(Rundell & Kilgarriff, 2011)

“too many choices early in the data-selection process leave more room forerror”

(Kosem, Gantar & Krek, 2013)

Page 10: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Main (unproven) criticisms

• Automatic tools cannot replace lexicographers

• Important information can be missed

• Analysis is not as detailed and reliable as with themanual approach

• Etc.

• Evaluation (Kosem et al. 2015)

Page 11: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

SLD entries

coverage of

syntactic

structures

coverage of

collocates under

structures

nouns 82.40% 72.79%

adjectives 94.33% 75.80%

adverbs 92.78% 78.32%

Page 12: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

• 100% coverage of all collocates:• 12% of noun entries• 8.4% of verb entries• 16.4% of adjective entries• 25% of adverb entries

• 100% coverage of collocates under syntactic structures:• 9.7% of noun entries• 18.5% of adjective entries• 22.5% of adverb entries

• 100% coverage of syntactic structures• 35.4% of noun entries• 81.1% of adjective entries• 82.5% of adverb entries.

Page 13: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Why not always 100%?

11.8.2015 Herstmonceux castle, eLex 2015

• Errors in SLD – a small amount (e.g. typos, wrong case of collocate under certain syntactic structure)

• Different corpora and sketch grammars used

• Parameters for automatic extraction quite strict• E.g. structure not exported if no collocates match the

minimum criteria structure marked as not found by ADE

• On the other hand:• Five to six times more collocates extracted• Several syntactic structures in automatically extracted data,

which were not detected by lexicographers• Several (good) examples match (more examples analysed)

Page 14: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Post-processing

• Tasks that are automated:• Converting extracted data into the correct form (lemma

+ collocate)

• Removing duplicate examples

• Cleaning examples of noise (e.g. removing any extra spaces before full stops and commas

• Assigning IDs of lemmas from the lexicon of word forms

• Other issues:• False collocates (e.g. tagging problems)

• Incorrect examples (i.e. where the collocation does not match the grammatical relation it belongs to)

• Grouping collocates, attributing them under senses, etc.

Page 15: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

"Crowdsourcing" in lexicography:(improving) the final product

(Abel & Meyer, 2013)

Page 16: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Crowdsourcing – dividing a complex task into a series of simple ones

• Why is crowdsourcing needed in lexicography:

• challenges:• lexicographers are facing increasing time constraints

& amounts of data

• lexicographers are overqualified for routine post-editing of automatic procedures

• potential:• non-expert individuals are talented, creative &

productive enough to solve such tasks

• modern technology makes using the potential of the crowd simple, affordable & effective

Page 17: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Crowdsourcing - caveats

• estimate of the required investment wrt. time, money & personnel is crucial (should not take up more time & resources than conventional methods)• if fully integrated in the project,

microtasks can be designed according to the same principles, use the same pre- & post-processing chains & platforms (economizing the initial investment)

Page 18: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Lessons learned

• Instructions must be clearly formulated and simple, answers must not allow grading (only YES, NO, I DON’T KNOW)

• not all automatically extracted data is suitable for crowdsourcing:• e.g. some grammatical relations are too complex for

evaluation

• users need to focus on some other objective: competition, credits, money (micro payments)

• Gamification:• examples: language games such as ESP Game (von Ahn,

2006) and Phrase Detectives (Chamberlain et al., 2008)

Page 19: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Lexicographical process of DCSL

Page 20: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond
Page 21: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

DCSL – implementation and future• Meeting the needs of users

• Release of entries at each stage (thus, dictionary is available from the start)

• Making the database available to NLP community, researchers etc.

• A parallel project for testing and improving the first stages of the procedure: Collocations dictionary of Slovene

Page 22: Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Thank you!

• Funded by Slovenian Research Agency project : Koncept madžarsko-slovenskega slovarja: od jezikovnega vira do uporabnika (V6-1509)


Recommended