Data in computational historical linguistics
Gerhard Jäger
ESSLLI 2016
Gerhard Jäger Data sources ESSLLI 2016 1 / 25
Background
Background
comparative method strongly focuses on two types of data:morphological paradigmsregular sound correspondences
both are not very suitable for computational approaches, becausemorphological categories are not easily comparable across languages,especially if we look individual language familiesalso, isolating languages have no morphologyidentifying regular sound correspondences automatically is a surprisinglyhard problem, due to data sparsenesscurrently one of the hot topics, far from resolved (List, 2014; Hruschkaet al., 2015; Bouchard-Côté et al., 2013)
Gerhard Jäger Data sources ESSLLI 2016 2 / 25
Background
Background
what we need (especially if we apply statistical methods):data types which are applicable to all natural languagesideally lots of data
current practice:word lists + expert annotations about cognacy (currently the dominantparadigm)unannotated word lists in phonetic transcriptionsdiscrete grammatical categorizations (compiled by human experts)
Gerhard Jäger Data sources ESSLLI 2016 3 / 25
Cognate-coded Swadesh lists
Cognate-coded Swadesh lists
Gerhard Jäger Data sources ESSLLI 2016 4 / 25
Cognate-coded Swadesh lists
Swadesh lists
collections of 100 – 200 concepts (there are different versions)core vocabulary:
not culture dependentdiachronically stable, i.e. resistant both against semantic change andaginst borrowing
proposed by Morris Swadesh (Swadesh, 1955, 1971) to facilitate anearly attempt to automatize certain tasks in historical linguisticspopular among computational historical linguistics because it is astandardsee (List, 2016) for a thoughtful discussion of the notion of cognacy
Gerhard Jäger Data sources ESSLLI 2016 5 / 25
Cognate-coded Swadesh lists
Cognates
Cognates are words that have the same originLatin filius ⇒ French fils, Italian figlio
traditionally, cognacy excludes loanwords, but terminology amongcomputationalists is sometimes less strict:
Latin persona ⇒ English personwould also qualify as cognate pairon average, the closer two languages are related, the more cognatepairs they share
Gerhard Jäger Data sources ESSLLI 2016 6 / 25
Cognate-coded Swadesh lists
Cognates
during language change, the word for a given concept is sometimesreplaced by a non-cognate onecauses: semantic change, borrowing, morphological word formation
’bone’: Old High German Bein (cognate to Engl. bone ⇒ New HighGerman KnochenBein is still part of the German lexicon, but it now means leg
cognate replacement is comparable to a mutation in biologicalevolution
Gerhard Jäger Data sources ESSLLI 2016 7 / 25
Cognate-coded Swadesh lists
Cognates
Caveats
cognacy is not binary, but a matter of degreeEnglish woman ⇐ Old English wiff-manfirst component is cognate to wife, German Weib etc., and secondcomponent to man, German Mann etc. Are woman and Weib cognateor not?
for distantly related languages, experts often disagree about cognacyAncient Greek ὕλη/Latin silva ‘woods’
Gerhard Jäger Data sources ESSLLI 2016 8 / 25
Cognate-coded Swadesh lists
IELex
Indo-European Lexical Cognacy Databasefreely available online at http://ielex.mpi.nl/based on Dyen et al. (1992)current version curated by group at MPI Nijmegenrecently migrated to new MPI Jena; new version not public yet
Gerhard Jäger Data sources ESSLLI 2016 9 / 25
Cognate-coded Swadesh lists
IELex
207-item Swadesh lists for 135 Indo-European languageswords in orthographic and partially in phonetic transcription (IPA)entries are assigned to cognate classessample entries:
language iso_code gloss global_id local_id transcription cognate_classELFDALIAN qov woman 962 woman kɛl̀ɪŋg woman:AgDUTCH nld woman 962 woman vrɑu woman:BGERMAN deu woman 962 woman fraŭ woman:BDANISH dan woman 962 woman g̥ʰvenə woman:DDANISH_FJOLDE woman 962 woman kvinʲ woman:DGUTNISH_LAU woman 962 woman kvɪnːˌfolk woman:DLATIN lat woman 962 woman mulier woman:ELATIN lat woman 962 woman feːmina woman:GENGLISH eng woman 962 woman wʊmən woman:HGERMAN deu woman 962 woman vaĭp woman:HDANISH dan woman 962 woman d̥ɛːmə woman:K
Gerhard Jäger Data sources ESSLLI 2016 10 / 25
Cognate-coded Swadesh lists
Other publicly available cognacy data sources
Austronesian Basic Vocabulary Databasehttp://language.psy.auckland.ac.nz/austronesian/ten collections of cognate-coded Swadesh lists from various languagefamilies collected by Johann-Mattis List1
ten collections of short (40-100 items) cognate-coded Swadesh listsfrom various language families collected by Søren Wichman and EricHolman2
88 cognate-coded Swadesh lists from Central-Asian languages3
1List, J.-M. (2014): Data from: Sequence comparison in historical linguistics. GitHubRepository. http://github.com/SequenceComparison/SupplementaryMaterial.Release: 1.0.
2Supplementary material to Wichmann and Holman (2013)3Supplementary material to Mennecier et al. (2016)
Gerhard Jäger Data sources ESSLLI 2016 11 / 25
Phonetically transcribed Swadesh lists
Phonetically transcribed Swadesh lists
Gerhard Jäger Data sources ESSLLI 2016 12 / 25
Phonetically transcribed Swadesh lists
The Automatic Similarity Judgment Program
Project originally hosted at MPI EVA in Leipzig around SørenWichmannsince 2009; currently version 17 (2016)covers more than 7,000 languages and dialects (4.574 languages withiso code)basic vocabulary of 40 words for each language, in uniform phonetictranscriptionfreely available at http://asjp.clld.org/
used concepts: I, you, we, one, two, person, fish, dog, louse, tree, leaf, skin,blood, bone, horn, ear, eye, nose, tooth, tongue, knee, hand, breast, liver, drink,see, hear, die, come, sun, star, water, stone, fire, path, mountain, night, full, new,name
Gerhard Jäger Data sources ESSLLI 2016 13 / 25
Phonetically transcribed Swadesh lists
The Automatic Similarity Judgment Program
Phonetic transcription
41 sound classes, all coded as ASCII charactersvarious diacritics to capture finer phonetic distinctions, e.g.
ph~: aspirated pa*: nasalized ahkw$: pre-aspirated labalized k
Metadata
language family, language genus, classifcation according to Ethnologueand Glottologgeographic locationpopulation size
Gerhard Jäger Data sources ESSLLI 2016 14 / 25
Phonetically transcribed Swadesh lists
The Automatic Similarity Judgment Program
ASJP sound classes (from Brown et al. 2013)ASJP code Description IPA symbolssymbolp voiceless bilabial stop and fricative p,ɸb voiced bilabial stop and fricative b, βf voiceless labiodental fricative fv voiced labiodental fricative vm bilabial nasal mw voiced bilabial-velar approximant w8 voiceless and voiced dental fricative θ, ð4 dental nasal n̪t voiceless alveolar stop td voiced alveolar stop ds voiceless alveolar fricative sz voiced alveolar fricative zc voiceless and voiced alveolar affricate ts, ʤn alveolar nasal nr voiced apico-alveolar flap and all other varieties of ɾ, r, ʀ, ɽ
“r-sounds”l voiced alveolar lateral approximant lS voiceless post-alveolar fricative ʃZ voiced post-alveolar fricative ʒC voiceless palato-alveolar affricate ʧj voiced palato-alveolar affricate ʤT voiceless and voiced palatal stop c, ɟ5 palatal nasal ɲy palatal approximant jk voiceless velar stop kg voiced velar stop gx voiceless and voiced velar fricative x,N velar nasal ŋ
ASJP code Description IPA symbolssymbolq voiceless uvular stop qG voiced uvular stop ɢX voiceless and voiced uvular fricative, voiceless and χ, ʁ, ħ, ʕ
voiced pharyngeal fricativeh voiceless and voiced glottal fricative h, ɦ7 voiceless glottal stop ʔL all other laterals ʟ, ɭ, λ! all varieties of “click-sounds” !, ǀ, ǁ, ǂi high front vowel, rounded and unrounded i, ɪ, y, ʏe mid front vowel, rounded and unrounded e, øE low front vowel, rounded and unrounded æ, ɛ, œ, ɶ3 high and mid central vowel, rounded and unrounded ɨ, ɘ, ə,ɜ, ʉ, ɵ, ɞa low central vowel, unrounded a, ɐu high back vowel, rounded and unrounded ɯ, uo mid and low back vowel, rounded and unrounded ɣ, ʌ, ɑ, o, ɔ, ɒ
Gerhard Jäger Data sources ESSLLI 2016 15 / 25
Phonetically transcribed Swadesh lists
Automated Similarity Judgment Projectconcept Latin English
I ego Eiyou tu yuwe nos wione unus w3ntwo duo tuperson persona, homo %pers3nfish piskis fiSdog kanis daglouse pedikulus laustree arbor trileaf foly∼u* lifskin kutis %skinblood saNgw∼is bl3dbone os bonhorn kornu hornear auris ireye okulus Einose nasus nostooth dens tu8tongue liNgw∼E t3N
concept Latin English
knee genu nihand manus hEndbreast pektus, mama brestliver yekur liv3rdrink bibere drinksee widere sihear audire hirdie mori dEicome wenire k3msun sol s3nstar stela starwater akw∼a wat3rstone lapis stonfire iNnis fEirpath viya pE8mountain mons %maunt3nnight noks nEitfull plenus fulnew nowus nuname nomen nem
Gerhard Jäger Data sources ESSLLI 2016 16 / 25
Phonetically transcribed Swadesh lists
NorthEuraLex
Massive data collection effort ofthe Tübingen EVOLAEMPproject(currently) translations of 1,017concepts into 103 (mostly)Northern Eurasian languages (cf.Dellert, 2015)everything transcriped in IPA(so far) no manual cognatecoding
Gerhard Jäger Data sources ESSLLI 2016 17 / 25
Grammatical classifications
Grammatical classifications
Gerhard Jäger Data sources ESSLLI 2016 18 / 25
Grammatical classifications
Grammatical classification databases
World Atlas of Language Structure (WALS) http://wals.info/Syntactic Structures of the World’s Languages (SSWL)http://sswl.railsplayground.net/collection of syntactic parameters (in the Chomskyan sense) for a fewdozen languages collected in the LanGeLin project (GiuseppeLongobardi)
Gerhard Jäger Data sources ESSLLI 2016 19 / 25
Expert family trees
Expert family trees
Gerhard Jäger Data sources ESSLLI 2016 20 / 25
Expert family trees
Expert family trees
Ethnologue https://www.ethnologue.com/Glottolog http://glottolog.org/
in many ways improved version of Ethnologuestrives to apply uniform standards across all languagesrather conservative in accepting family status
Gerhard Jäger Data sources ESSLLI 2016 21 / 25
Running example
Running example
Gerhard Jäger Data sources ESSLLI 2016 22 / 25
Running example
Running example
25 living Indo-European languagesthree types of data
Swadesh lists in IPA transcription, taken from IELexexpert cognate classifications of Swadesh list entries (likewise takenfrom IELex),4 andphonological, grammatical and semantic classifications of languages(taken from WALS)
4I only included those entries from IELex where both an IPA transcription and a cognateclassification is given.
Gerhard Jäger Data sources ESSLLI 2016 23 / 25
Running example
Running examplesample entries:
language phonological form cognate class order of subject, object and verb(IELex) (IELex) (WALS)
Bengali - - SOVBreton - - SVOBulgarian muˈrɛ sea:B SVOCatalan mar; maɾ; ma sea:B SVOCzech ˈmɔr̝ɛ sea:B SVODanish hɑw/søˀ sea:K/sea:J SVODutch ze sea:J no dominant orderEnglish si: sea:J SVOFrench mɛʀ sea:B SVOGerman ze:/’o:ts͜ea:n/me:ɐ̯ sea:J/sea:E/sea:B no dominant orderGreek ˈθalaˌsa sea:F no dominant orderHindi - - SOVIcelandic haːv/sjouːr sea:K/sea:J SVOIrish ˈfˠæɾˠɟɪ sea:G VSOItalian ˈmare sea:B SVOLithuanian ˈju:rɐ sea:H SVONepali - - SOVPolish ˈmɔʐɛ sea:B SVOPortuguese maɾ sea:B SVORomanian ˈmare sea:B SVORussian ˈmɔrʲɛ sea:B SVOSpanish maɾ sea:B SVOSwedish hɑːv/ɧøː sea:K/sea:J SVOUkrainian ˈmɔrɛ sea:B SVOWelsh - - VSO
Gerhard Jäger Data sources ESSLLI 2016 24 / 25
Running example
Exercises
1 Access the files ielexData.csv and walsData.csv from our runningexample from http://www.sfs.uni-tuebingen.de/~gjaeger/esslli2016/data/
1 Are there any WALS feature values exclusively occurring in theRomance languages?
2 Are there any cognate classes exclusively occurring in the Romancelanguages?
3 Are there any sound shifts (with instances in our data) exclusivelyoccurring in the Romance languages?
4 Answer the same questions for the Slavic languages.
Gerhard Jäger Data sources ESSLLI 2016 25 / 25
References
Bouchard-Côté, A., D. Hall, T. L. Griffiths, and D. Klein (2013).Automated reconstruction of ancient languages using probabilisticmodels of sound change. Proceedings of the National Academy ofSciences, 36(2):141–150.
Brown, C. H., E. Holman, and S. Wichmann (2013). Soundcorrespondences in the world’s languages. Language, 89(1):4–29.
Dellert, J. (2015). Compiling the Uralic dataset for NorthEuraLex, alexicostatistical database of Northern Eurasia. Proceedings of the FirstInternational Workshop on Computational Linguistics for UralicLanguages. January 16, Tromsø, Norway.
Dyen, I., J. B. Kruskal, and P. Black (1992). An Indoeuropeanclassification: A lexicostatistical experiment. Transactions of theAmerican Philosophical Society, 82(5):1–132.
Hruschka, D. J., S. Branford, E. D. Smitch, J. Wilkins, A. Meade,M. Pagel, and T. Bhattachary (2015). Detecting regular sound changesin linguistics as events of concerted evolution. Current Biology,25(1):1–9.
Gerhard Jäger Data sources ESSLLI 2016 25 / 25
Running example
List, J.-M. (2014). Sequence Comparison in Historical Linguistics.Düsseldorf University Press, Düsseldorf.
List, J.-M. (2016). Beyond cognacy: historical relations between words andtheir implication for phylogenetic reconstruction. Journal of LanguageEvolution, 1(1):119–136. Doi: 10.1093/jole/lzw006.
Mennecier, P., J. Nerbonne, E. Heyer, and F. Manni (2016). A CentralAsian language survey: Collecting data, measuring relatedness anddetecting loans. Language Dynamics and Change, 6(1). In press.
Swadesh, M. (1955). Towards greater accuracy in lexicostatistic dating.International Journal of American Linguistics, 21:121–137.
Swadesh, M. (1971). The Origin and Diversification of Language. Aldine,Chicago.
Wichmann, S. and E. W. Holman (2013). Languages with longer wordshave more lexical change. In L. Borin and A. Saxena, eds., Approachesto Measuring Linguistic Differences, pp. 249–284. Mouton de Gruyter,Berlin.
Gerhard Jäger Data sources ESSLLI 2016 25 / 25