Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... ·...

Post on 17-Aug-2020

1 views 0 download

transcript

Quantitative approaches to linguistic similarity

Michael CysouwWork-in-Progress, 9 November 2004

Part 1

Distribution of rare characteristics

• Using the WALS-data to approach some perennial questions:

• Are there languages that have many typologically rare characteristics?

• Are there regions that show a relatively high density of rare features?

• Does rarity cluster?

Rarity Index Ri

Rfi = n ⋅ f if tot

n = number of feature valuesfi = frequency of feature value iftot = total number of languages included

For fi /ftot< 1/n :

Rfi =1n −1

n ⋅ f if tot

−1

+1For fi /ftot > 1/n :

Order of Object and Verb (by Matthew Dryer)

Computing Rarity Index

• Three feature values (n = 3)• Frequencies fi:

640 (OV), 639 (VO), 91 (no preference)• Total ftot = 640 + 639 + 91 = 1370• Rov = 1.20

Rvo = 1.20Rnopref = 0.20

Indices of English

All languagesMAll languagesM

All languages

rarity indexMrarity indexM

rarity index

number of occurencesMnumber of occurencesM

num

ber

of occure

nces

m0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

2.02.0M2.0M

2.0

m00M0M

0

50005000M5000M

5000

1000010000M10000M

10000

1500015000M15000M

15000

EnglishMEnglishM

English

rarity indexMrarity indexM

rarity index

number of occurencesMnumber of occurencesM

num

ber

of occure

nces

m0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

2.02.0M2.0M

2.0

m00M0M

0

55M5M

5

1010M10M

10

1515M15M

15

2020M20M

20

2525M25M

25

3030M30M

30

Indices of Indonesian

All languagesMAll languagesM

All languages

rarity indexMrarity indexM

rarity index

number of occurencesMnumber of occurencesM

num

ber

of occure

nces

m0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

2.02.0M2.0M

2.0

m00M0M

0

50005000M5000M

5000

1000010000M10000M

10000

1500015000M15000M

15000

IndonesianMIndonesianM

Indonesian

rarity indexMrarity indexM

rarity index

number of occurencesMnumber of occurencesM

num

ber

of occure

nces

m0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

2.02.0M2.0M

2.0

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

Median of Rarity Indices

median of rarity indices per languageMmedian of rarity indices per languageM

median of rarity indices per language

FrequencyMFrequencyM

Fre

quency

m0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

m00M0M

0

200200M200M

200

400400M400M

400

600600M600M

600

800800M800M

800

10001000M1000M

1000

median of rarity indices per languageMmedian of rarity indices per languageM

median of rarity indices per language

FrequencyMFrequencyM

Fre

quency

m0.20.2M0.2M

0.2

0.40.4M0.4M

0.4

0.60.6M0.6M

0.6

0.80.8M0.8M

0.8

1.01.0M1.0M

1.0

m00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

Influence of amount of datam00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

Number of features codedMNumber of features codedM

Number of features coded

Median of rarity indicesMMedian of rarity indicesMM

edia

n o

f ra

rity

indic

es

Number of indices smaller than onem00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

4040M40M

40

5050M50M

50

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M

Num

ber

of ra

rity

indic

es s

malle

r th

an 1

Indo-european languagesm00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

4040M40M

40

5050M50M

50

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M

Num

ber

of ra

rity

indic

es s

malle

r th

an 1

German, French, English, Russian, Greek, Irish,

Latvian

Germanicm00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

4040M40M

40

5050M50M

50

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M

Num

ber

of ra

rity

indic

es s

malle

r th

an 1

Germanic

Indo-European

Slavicm00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

4040M40M

40

5050M50M

50

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M

Num

ber

of ra

rity

indic

es s

malle

r th

an 1 Slavic

Germanic

Indo-European

Going more extreme...(Number of rarity indices smaller than .5)

m00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

55M5M

5

1010M10M

10

1515M15M

15

2020M20M

20

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M

Num

ber

of ra

rity

indic

es s

malle

r th

an .5

Languages with many rare features

m00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

55M5M

5

1010M10M

10

1515M15M

15

2020M20M

20

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M

Num

ber

of ra

rity

indic

es s

malle

r th

an .5

Languages with many rare features

Caucasian languagesm00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

55M5M

5

1010M10M

10

1515M15M

15

2020M20M

20

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M

Num

ber

of ra

rity

indic

es s

malle

r th

an .5

Next steps

• Improve the integrations of R with the WALS-programm

• Going from exploring the data to testing hypotheses

• Taking the feature-perspective: which rare features cluster?

Part 2

Quantitative approaches to historical relatedness

• Range of recent papers, using methods from biological phylogenetic reconstruction to infer linguistic family trees

• But: only final part of historical-comparative method is taken up

• Almost all on higher groupings of Indo-European

• But: one should first check validity by applying the methods to agreed upon classifications

Inferring tree is just one of the many possibilities of quantitative approaches

Dictionaries, etc.

Cognate sets

Regular sound correspondences

Family treeInfer missing cognates

Testing Holm’s approach(together with Søren Wichmann and David Kamholz)

• Holm’s idea: use etymological dictionary instead of Swadesh-style wordlists

• By counting the number of shared retentions for each pair of languages, he estimates the relative point of split between each pair (dissimilarity estimates)

• In simulations (by Kamholz) the approach seems to work

• We tested the method on Mixe-Zoque data

Interpreting the estimates

Following Holm’s interpretation:

SHM

OlP SaP

ChZ

AyZ

SoZ

ChisZ

NHM

MM

LM

Using ADDTREE on the estimates:

How did it work out?

Holm’s method:

Wichmann (1995)

The Popolucan errors

Holm’s method:

Wichmann (1995)

The Popolucan errors

• Error 1: they are grouped together, because of many shared retentions

• But: there are no shared innovations!

• Error 2: they are grouped with Zoque instead of with Mixe

• Circularity problem: reconstruction depends on tree, and Holm makes tree out of reconstruction

The other two errors

Holm’s method:

Wichmann (1995)

Difficult to place in the tree

SHM

OlP SaP

ChZ

AyZ

SoZ

ChisZ

NHM

MM

LM

LanguageNumber of dictionary

entriesLM 7000

ChisZ 6000NHM 5600MM 4100OlP 4000SaP 3600AyZ 2000ChZ 1600SoZ 800

SHM 700

Estimates of available knowledge about Mixe-Zoque

Number of retentions depends on available knowledge

R2 = 0.7497

0

1000

2000

3000

4000

5000

6000

7000

8000

0 100 200 300 400 500 600

Number of retentions

Est

imate

s o

f avail

ab

le d

ata

Spread of estimates depends on available knowledge

R2 = 0.6197

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70 80

Spread of Estimates (Standart deviation)

Est

imate

s o

f availab

le d

ata

Summary of problems

• Absence of shared innovations is not counted

• The data that enter in the analysis (i.e. reconstructed etyma) partly depend on the outcome (i.e. the tree)

• Unbalanced amount of available data distorts the estimates

The End