+ All Categories
Home > Documents > Word Frequency Distributions across Languages · Introduction Approach Preliminary...

Word Frequency Distributions across Languages · Introduction Approach Preliminary...

Date post: 29-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
20
Introduction Approach Preliminary Results Conclusion Word Frequency Distributions across Languages Trudie Strauss 1 Michael J. Von Maltitz 1 Damián E. Blasi 2 1 Department of Mathematical Statistics and Actuarial Science University of the Free State South Africa 2 University of Zurich Switzerland TBI Winterseminar, Bled, 2018 Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa) Word Frequency Distributions across Languages
Transcript
Page 1: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Word Frequency Distributions across Languages

Trudie Strauss1 Michael J. Von Maltitz1 Damián E. Blasi2

1Department of Mathematical Statistics and Actuarial ScienceUniversity of the Free State

South Africa

2University of ZurichSwitzerland

TBI Winterseminar, Bled, 2018

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 2: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 3: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Outline

1 IntroductionImportance of Word FrequencyHistory

2 ApproachOverviewData and MethodsExpectations

3 Preliminary ResultsResultsExplanation

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 4: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Importance of Word Frequency

Importance of Word Frequency

Word frequency distributions are a central object of study inthe language sciencesfrequency of words determines many important phenomena inlanguagee.g. age of acquisition, rate of change through time...

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 5: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

History

History

Simple and explicit parametric models: power-law distributionsZipf’s Law:

Zipf’s Law

f (r) ∝ 1rα

for α ≈ 1

Adapted, ”improved” models, higher complexityWhy do word frequencies follow the distribution they do?

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 6: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Overview

Our Approach

data-orientedavailable datacomputing power

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 7: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Overview

Calculate 32 word frequency distribution measures of lexicaldiversity as multidimensional space describing the distribution, e.g.:

mean frequency of wordsskewnesskurtosisentropynumber of hapax/dislegomena

Token / Type

Tokens - total number words in atextTypes - number of unique wordsin a text

best Zipf parametric fit for each text; compare every measurewith simulations from simulated theoretical distribution

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 8: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Data and Methods

Data

For each language in the Leipzig Corpus 1

download largest, most recent Wikpedia text file (sentences)

90000 to 20000000 words

create word list with R-package, tidytext

1D. Goldhahn, T. Eckart & U. Quasthoff: Building Large MonolingualDictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.In: Proceedings of the 8th International Language Ressources and Evaluation(LREC‘12), 2012

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 9: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Data and Methods

Method

We fit the following parametric models to the data:power lawlog-normalexponentialPoisson

With parameters estimated from the empirical data, using packagepoweRlawExample language: Afrikaans

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 10: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Data and Methods

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●

1 100 10000

1e−

051e

−03

1e−

01

n

S(n

)

Afrikaans

Power_LawLog−NormalPoissonExponential

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 11: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Data and Methods

Method

Empirical Data:Define ni = 1000 : N (for i = 1 : 100)sample ni words from the initial word listdetermine value for each of the measuresEach language: 100values of ni × 35measures

Simulated DataFit power law distribution to each sample, using the α valuecalculated on entire data setSimulate from theoretical distribution forni = 1000 : N (for i = 1 : 100)value of each measure, mean over all simulations for ni100values of ni × 35mean of simulated measures

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 12: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Expectations

Expectations

Go beyond simple parametric models, and determineindividual measures differ across languagesto what extentinfluence of N

From the power law fit we see that thepower law distribution is reasonable for some NWe therefore expect that

empirical measures should correspond to simulated measuresfor certain values of Nwe can identify “optimal” N for which measures correspond totheoretical distribution.

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 13: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Results

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●

1 100 10000

1e−

051e

−03

1e−

01

n

S(n

)

Afrikaans

Power_LawLog−NormalPoissonExponential

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 14: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Individual Measures

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 15: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Individual Measures

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 16: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Individual Measures

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 17: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Individual Measures

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 18: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Explanation

But why?

assuming analysis was done correctly, this discrepancy between thepower law fit and the individual measures, could make sense:

Expected:while Zipf seems to be a good approximation of thedistribution as a wholewhen you zoom in, it fails to deliver in many respects

data/reasoning/algorithmic errors

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 19: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Thank you

Damián E. Blasi

University of the Free StateMichael J. von MaltitzSean van der Merwe

Leipzig UniversityPeter StadlerNancy RetzlaffSarah Berkemer

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages

Page 20: Word Frequency Distributions across Languages · Introduction Approach Preliminary ResultsConclusion Word Frequency Distributions across Languages Trudie Strauss1 MichaelJ.VonMaltitz1

Introduction Approach Preliminary Results Conclusion

Funding

South African NationalResearch FoundationKnowledge, Interchange andCollaboration Grant

Trudie Strauss, Michael J. Von Maltitz, Damián E. Blasi University of the Free State (South Africa)

Word Frequency Distributions across Languages


Recommended