+ All Categories
Home > Documents > Data in computational historical linguistics

Data in computational historical linguistics

Date post: 08-Feb-2017
Category:
Upload: vokhuong
View: 216 times
Download: 1 times
Share this document with a friend
27
Data in computational historical linguistics Gerhard Jäger ESSLLI 2016 Gerhard Jäger Data sources ESSLLI 2016 1 / 25
Transcript
Page 1: Data in computational historical linguistics

Data in computational historical linguistics

Gerhard Jäger

ESSLLI 2016

Gerhard Jäger Data sources ESSLLI 2016 1 / 25

Page 2: Data in computational historical linguistics

Background

Background

comparative method strongly focuses on two types of data:morphological paradigmsregular sound correspondences

both are not very suitable for computational approaches, becausemorphological categories are not easily comparable across languages,especially if we look individual language familiesalso, isolating languages have no morphologyidentifying regular sound correspondences automatically is a surprisinglyhard problem, due to data sparsenesscurrently one of the hot topics, far from resolved (List, 2014; Hruschkaet al., 2015; Bouchard-Côté et al., 2013)

Gerhard Jäger Data sources ESSLLI 2016 2 / 25

Page 3: Data in computational historical linguistics

Background

Background

what we need (especially if we apply statistical methods):data types which are applicable to all natural languagesideally lots of data

current practice:word lists + expert annotations about cognacy (currently the dominantparadigm)unannotated word lists in phonetic transcriptionsdiscrete grammatical categorizations (compiled by human experts)

Gerhard Jäger Data sources ESSLLI 2016 3 / 25

Page 4: Data in computational historical linguistics

Cognate-coded Swadesh lists

Cognate-coded Swadesh lists

Gerhard Jäger Data sources ESSLLI 2016 4 / 25

Page 5: Data in computational historical linguistics

Cognate-coded Swadesh lists

Swadesh lists

collections of 100 – 200 concepts (there are different versions)core vocabulary:

not culture dependentdiachronically stable, i.e. resistant both against semantic change andaginst borrowing

proposed by Morris Swadesh (Swadesh, 1955, 1971) to facilitate anearly attempt to automatize certain tasks in historical linguisticspopular among computational historical linguistics because it is astandardsee (List, 2016) for a thoughtful discussion of the notion of cognacy

Gerhard Jäger Data sources ESSLLI 2016 5 / 25

Page 6: Data in computational historical linguistics

Cognate-coded Swadesh lists

Cognates

Cognates are words that have the same originLatin filius ⇒ French fils, Italian figlio

traditionally, cognacy excludes loanwords, but terminology amongcomputationalists is sometimes less strict:

Latin persona ⇒ English personwould also qualify as cognate pairon average, the closer two languages are related, the more cognatepairs they share

Gerhard Jäger Data sources ESSLLI 2016 6 / 25

Page 7: Data in computational historical linguistics

Cognate-coded Swadesh lists

Cognates

during language change, the word for a given concept is sometimesreplaced by a non-cognate onecauses: semantic change, borrowing, morphological word formation

’bone’: Old High German Bein (cognate to Engl. bone ⇒ New HighGerman KnochenBein is still part of the German lexicon, but it now means leg

cognate replacement is comparable to a mutation in biologicalevolution

Gerhard Jäger Data sources ESSLLI 2016 7 / 25

Page 8: Data in computational historical linguistics

Cognate-coded Swadesh lists

Cognates

Caveats

cognacy is not binary, but a matter of degreeEnglish woman ⇐ Old English wiff-manfirst component is cognate to wife, German Weib etc., and secondcomponent to man, German Mann etc. Are woman and Weib cognateor not?

for distantly related languages, experts often disagree about cognacyAncient Greek ὕλη/Latin silva ‘woods’

Gerhard Jäger Data sources ESSLLI 2016 8 / 25

Page 9: Data in computational historical linguistics

Cognate-coded Swadesh lists

IELex

Indo-European Lexical Cognacy Databasefreely available online at http://ielex.mpi.nl/based on Dyen et al. (1992)current version curated by group at MPI Nijmegenrecently migrated to new MPI Jena; new version not public yet

Gerhard Jäger Data sources ESSLLI 2016 9 / 25

Page 10: Data in computational historical linguistics

Cognate-coded Swadesh lists

IELex

207-item Swadesh lists for 135 Indo-European languageswords in orthographic and partially in phonetic transcription (IPA)entries are assigned to cognate classessample entries:

language iso_code gloss global_id local_id transcription cognate_classELFDALIAN qov woman 962 woman kɛl̀ɪŋg woman:AgDUTCH nld woman 962 woman vrɑu woman:BGERMAN deu woman 962 woman fraŭ woman:BDANISH dan woman 962 woman g̥ʰvenə woman:DDANISH_FJOLDE woman 962 woman kvinʲ woman:DGUTNISH_LAU woman 962 woman kvɪnːˌfolk woman:DLATIN lat woman 962 woman mulier woman:ELATIN lat woman 962 woman feːmina woman:GENGLISH eng woman 962 woman wʊmən woman:HGERMAN deu woman 962 woman vaĭp woman:HDANISH dan woman 962 woman d̥ɛːmə woman:K

Gerhard Jäger Data sources ESSLLI 2016 10 / 25

Page 11: Data in computational historical linguistics

Cognate-coded Swadesh lists

Other publicly available cognacy data sources

Austronesian Basic Vocabulary Databasehttp://language.psy.auckland.ac.nz/austronesian/ten collections of cognate-coded Swadesh lists from various languagefamilies collected by Johann-Mattis List1

ten collections of short (40-100 items) cognate-coded Swadesh listsfrom various language families collected by Søren Wichman and EricHolman2

88 cognate-coded Swadesh lists from Central-Asian languages3

1List, J.-M. (2014): Data from: Sequence comparison in historical linguistics. GitHubRepository. http://github.com/SequenceComparison/SupplementaryMaterial.Release: 1.0.

2Supplementary material to Wichmann and Holman (2013)3Supplementary material to Mennecier et al. (2016)

Gerhard Jäger Data sources ESSLLI 2016 11 / 25

Page 12: Data in computational historical linguistics

Phonetically transcribed Swadesh lists

Phonetically transcribed Swadesh lists

Gerhard Jäger Data sources ESSLLI 2016 12 / 25

Page 13: Data in computational historical linguistics

Phonetically transcribed Swadesh lists

The Automatic Similarity Judgment Program

Project originally hosted at MPI EVA in Leipzig around SørenWichmannsince 2009; currently version 17 (2016)covers more than 7,000 languages and dialects (4.574 languages withiso code)basic vocabulary of 40 words for each language, in uniform phonetictranscriptionfreely available at http://asjp.clld.org/

used concepts: I, you, we, one, two, person, fish, dog, louse, tree, leaf, skin,blood, bone, horn, ear, eye, nose, tooth, tongue, knee, hand, breast, liver, drink,see, hear, die, come, sun, star, water, stone, fire, path, mountain, night, full, new,name

Gerhard Jäger Data sources ESSLLI 2016 13 / 25

Page 14: Data in computational historical linguistics

Phonetically transcribed Swadesh lists

The Automatic Similarity Judgment Program

Phonetic transcription

41 sound classes, all coded as ASCII charactersvarious diacritics to capture finer phonetic distinctions, e.g.

ph~: aspirated pa*: nasalized ahkw$: pre-aspirated labalized k

Metadata

language family, language genus, classifcation according to Ethnologueand Glottologgeographic locationpopulation size

Gerhard Jäger Data sources ESSLLI 2016 14 / 25

Page 15: Data in computational historical linguistics

Phonetically transcribed Swadesh lists

The Automatic Similarity Judgment Program

ASJP sound classes (from Brown et al. 2013)ASJP code Description IPA symbolssymbolp voiceless bilabial stop and fricative p,ɸb voiced bilabial stop and fricative b, βf voiceless labiodental fricative fv voiced labiodental fricative vm bilabial nasal mw voiced bilabial-velar approximant w8 voiceless and voiced dental fricative θ, ð4 dental nasal n̪t voiceless alveolar stop td voiced alveolar stop ds voiceless alveolar fricative sz voiced alveolar fricative zc voiceless and voiced alveolar affricate ts, ʤn alveolar nasal nr voiced apico-alveolar flap and all other varieties of ɾ, r, ʀ, ɽ

“r-sounds”l voiced alveolar lateral approximant lS voiceless post-alveolar fricative ʃZ voiced post-alveolar fricative ʒC voiceless palato-alveolar affricate ʧj voiced palato-alveolar affricate ʤT voiceless and voiced palatal stop c, ɟ5 palatal nasal ɲy palatal approximant jk voiceless velar stop kg voiced velar stop gx voiceless and voiced velar fricative x,N velar nasal ŋ

ASJP code Description IPA symbolssymbolq voiceless uvular stop qG voiced uvular stop ɢX voiceless and voiced uvular fricative, voiceless and χ, ʁ, ħ, ʕ

voiced pharyngeal fricativeh voiceless and voiced glottal fricative h, ɦ7 voiceless glottal stop ʔL all other laterals ʟ, ɭ, λ! all varieties of “click-sounds” !, ǀ, ǁ, ǂi high front vowel, rounded and unrounded i, ɪ, y, ʏe mid front vowel, rounded and unrounded e, øE low front vowel, rounded and unrounded æ, ɛ, œ, ɶ3 high and mid central vowel, rounded and unrounded ɨ, ɘ, ə,ɜ, ʉ, ɵ, ɞa low central vowel, unrounded a, ɐu high back vowel, rounded and unrounded ɯ, uo mid and low back vowel, rounded and unrounded ɣ, ʌ, ɑ, o, ɔ, ɒ

Gerhard Jäger Data sources ESSLLI 2016 15 / 25

Page 16: Data in computational historical linguistics

Phonetically transcribed Swadesh lists

Automated Similarity Judgment Projectconcept Latin English

I ego Eiyou tu yuwe nos wione unus w3ntwo duo tuperson persona, homo %pers3nfish piskis fiSdog kanis daglouse pedikulus laustree arbor trileaf foly∼u* lifskin kutis %skinblood saNgw∼is bl3dbone os bonhorn kornu hornear auris ireye okulus Einose nasus nostooth dens tu8tongue liNgw∼E t3N

concept Latin English

knee genu nihand manus hEndbreast pektus, mama brestliver yekur liv3rdrink bibere drinksee widere sihear audire hirdie mori dEicome wenire k3msun sol s3nstar stela starwater akw∼a wat3rstone lapis stonfire iNnis fEirpath viya pE8mountain mons %maunt3nnight noks nEitfull plenus fulnew nowus nuname nomen nem

Gerhard Jäger Data sources ESSLLI 2016 16 / 25

Page 17: Data in computational historical linguistics

Phonetically transcribed Swadesh lists

NorthEuraLex

Massive data collection effort ofthe Tübingen EVOLAEMPproject(currently) translations of 1,017concepts into 103 (mostly)Northern Eurasian languages (cf.Dellert, 2015)everything transcriped in IPA(so far) no manual cognatecoding

Gerhard Jäger Data sources ESSLLI 2016 17 / 25

Page 18: Data in computational historical linguistics

Grammatical classifications

Grammatical classifications

Gerhard Jäger Data sources ESSLLI 2016 18 / 25

Page 19: Data in computational historical linguistics

Grammatical classifications

Grammatical classification databases

World Atlas of Language Structure (WALS) http://wals.info/Syntactic Structures of the World’s Languages (SSWL)http://sswl.railsplayground.net/collection of syntactic parameters (in the Chomskyan sense) for a fewdozen languages collected in the LanGeLin project (GiuseppeLongobardi)

Gerhard Jäger Data sources ESSLLI 2016 19 / 25

Page 20: Data in computational historical linguistics

Expert family trees

Expert family trees

Gerhard Jäger Data sources ESSLLI 2016 20 / 25

Page 21: Data in computational historical linguistics

Expert family trees

Expert family trees

Ethnologue https://www.ethnologue.com/Glottolog http://glottolog.org/

in many ways improved version of Ethnologuestrives to apply uniform standards across all languagesrather conservative in accepting family status

Gerhard Jäger Data sources ESSLLI 2016 21 / 25

Page 22: Data in computational historical linguistics

Running example

Running example

Gerhard Jäger Data sources ESSLLI 2016 22 / 25

Page 23: Data in computational historical linguistics

Running example

Running example

25 living Indo-European languagesthree types of data

Swadesh lists in IPA transcription, taken from IELexexpert cognate classifications of Swadesh list entries (likewise takenfrom IELex),4 andphonological, grammatical and semantic classifications of languages(taken from WALS)

4I only included those entries from IELex where both an IPA transcription and a cognateclassification is given.

Gerhard Jäger Data sources ESSLLI 2016 23 / 25

Page 24: Data in computational historical linguistics

Running example

Running examplesample entries:

language phonological form cognate class order of subject, object and verb(IELex) (IELex) (WALS)

Bengali - - SOVBreton - - SVOBulgarian muˈrɛ sea:B SVOCatalan mar; maɾ; ma sea:B SVOCzech ˈmɔr̝ɛ sea:B SVODanish hɑw/søˀ sea:K/sea:J SVODutch ze sea:J no dominant orderEnglish si: sea:J SVOFrench mɛʀ sea:B SVOGerman ze:/’o:ts͜ea:n/me:ɐ̯ sea:J/sea:E/sea:B no dominant orderGreek ˈθalaˌsa sea:F no dominant orderHindi - - SOVIcelandic haːv/sjouːr sea:K/sea:J SVOIrish ˈfˠæɾˠɟɪ sea:G VSOItalian ˈmare sea:B SVOLithuanian ˈju:rɐ sea:H SVONepali - - SOVPolish ˈmɔʐɛ sea:B SVOPortuguese maɾ sea:B SVORomanian ˈmare sea:B SVORussian ˈmɔrʲɛ sea:B SVOSpanish maɾ sea:B SVOSwedish hɑːv/ɧøː sea:K/sea:J SVOUkrainian ˈmɔrɛ sea:B SVOWelsh - - VSO

Gerhard Jäger Data sources ESSLLI 2016 24 / 25

Page 25: Data in computational historical linguistics

Running example

Exercises

1 Access the files ielexData.csv and walsData.csv from our runningexample from http://www.sfs.uni-tuebingen.de/~gjaeger/esslli2016/data/

1 Are there any WALS feature values exclusively occurring in theRomance languages?

2 Are there any cognate classes exclusively occurring in the Romancelanguages?

3 Are there any sound shifts (with instances in our data) exclusivelyoccurring in the Romance languages?

4 Answer the same questions for the Slavic languages.

Gerhard Jäger Data sources ESSLLI 2016 25 / 25

Page 26: Data in computational historical linguistics

References

Bouchard-Côté, A., D. Hall, T. L. Griffiths, and D. Klein (2013).Automated reconstruction of ancient languages using probabilisticmodels of sound change. Proceedings of the National Academy ofSciences, 36(2):141–150.

Brown, C. H., E. Holman, and S. Wichmann (2013). Soundcorrespondences in the world’s languages. Language, 89(1):4–29.

Dellert, J. (2015). Compiling the Uralic dataset for NorthEuraLex, alexicostatistical database of Northern Eurasia. Proceedings of the FirstInternational Workshop on Computational Linguistics for UralicLanguages. January 16, Tromsø, Norway.

Dyen, I., J. B. Kruskal, and P. Black (1992). An Indoeuropeanclassification: A lexicostatistical experiment. Transactions of theAmerican Philosophical Society, 82(5):1–132.

Hruschka, D. J., S. Branford, E. D. Smitch, J. Wilkins, A. Meade,M. Pagel, and T. Bhattachary (2015). Detecting regular sound changesin linguistics as events of concerted evolution. Current Biology,25(1):1–9.

Gerhard Jäger Data sources ESSLLI 2016 25 / 25

Page 27: Data in computational historical linguistics

Running example

List, J.-M. (2014). Sequence Comparison in Historical Linguistics.Düsseldorf University Press, Düsseldorf.

List, J.-M. (2016). Beyond cognacy: historical relations between words andtheir implication for phylogenetic reconstruction. Journal of LanguageEvolution, 1(1):119–136. Doi: 10.1093/jole/lzw006.

Mennecier, P., J. Nerbonne, E. Heyer, and F. Manni (2016). A CentralAsian language survey: Collecting data, measuring relatedness anddetecting loans. Language Dynamics and Change, 6(1). In press.

Swadesh, M. (1955). Towards greater accuracy in lexicostatistic dating.International Journal of American Linguistics, 21:121–137.

Swadesh, M. (1971). The Origin and Diversification of Language. Aldine,Chicago.

Wichmann, S. and E. W. Holman (2013). Languages with longer wordshave more lexical change. In L. Borin and A. Saxena, eds., Approachesto Measuring Linguistic Differences, pp. 249–284. Mouton de Gruyter,Berlin.

Gerhard Jäger Data sources ESSLLI 2016 25 / 25


Recommended