www.linguistik.fau.de | www.stefan-evert.de
Making sense of multivariate analyses of linguistic variation Stefan Evert
COMPUTATIONAL CORPUS LINGUISTICS GROUPPROFESSUR FÜR KORPUSLINGUISTIK
Multidimensional analysis (Biber 1988)§ 481 texts, 67 lexico-
grammatical features§ unsupervised FA§ validation: separation of
“known” genre categoriesProblems§ choice of features & texts§ interpretation of FA weights
Biber,Douglas(1988).VariationAcrossSpeechandWriting.CambridgeUniversityPress,Cambridge.Diwersy,Sascha;Evert,Stefan;Neumann,Stella(2014).Aweaklysupervisedmultivariateapproachtothestudyoflanguagevariation.In:Aggregating
Dialectology,Typology,andRegisterAnalysis.LinguisticVariationinTextandSpeech,pages174–204.DeGruyter,Berlin,Boston.Evert,Stefan&Neumann,Stella(2017).Theimpactoftranslationdirectiononcharacteristicsoftranslatedtexts.AmultivariateanalysisforEnglishand
German.In:EmpiricalTranslationStudies.NewTheoreticalandMethodologicalTraditions,TiLSM300,pages47–80.MoutondeGruyter,Berlin.Evert,Stefan;Proisl,Thomas;Jannidis,Fotis;Pielström,Steffen;Schöch,Christof;Vitt,Thorsten(2015).Towardsabetter understandingofBurrows'sDelta
inliteraryauthorshipattribution.InProceedingsoftheFourthWorkshoponComputationalLinguisticsforLiterature,pages79–88,Denver,CO.Evert,Stefan;Proisl,Thomas;Jannidis,Fotis;Reger,Isabella;Pielström,Steffen;Schöch,Christof;Vitt,Thorsten(2017). Understandingandexplaining
Deltameasuresforauthorshipattribution.DigitalScholarshipintheHumanities.Advanceaccesshttps://doi.org/10.1093/llc/fqw046.
Case study II: Evidence for shining-through in translationsminimally supervised PCA (linear discriminant analysis)§ 298 texts from CroCo corpus
(78× EN➞DE, 71× DE➞EN)§ 27 features grounded in SFL
§ LDA for DE vs. EN originals§ position of translations ➞
evidence for shining-through
Problems§ interpretation of LDA weights§ are weights stable or do they
depend on choice of texts?§ is our selection of features
crucial to the results?
Case study I: Authorship attribution with Burrows’s Deltaunsupervised clustering§ 25 authors × 3 novels for EN, DE, FR§ 200 – 5000 features§ Ward clustering / PAM
Problems§ only 75 texts§ how & why does ΔB
work so well?
�B(D1, D2) =nwX
i=1
��zi(D1)� zi(D2)��
0500
1000
1500
2000
2500
Ward clustering (English, z−scores, BD, n=1000)
thac
kera
y: v
irgin
ians
thac
kera
y: p
ende
nnis
thac
kera
y: e
smon
dm
ered
ith: r
ichm
ond
mer
edith
: mar
riage
mer
edith
: fev
erel
lytto
n: k
enel
mly
tton:
nov
elly
tton:
wha
t core
lli: i
nnoc
ent
core
lli: r
oman
ceco
relli
: sat
ancb
ront
e: s
hirle
ycb
ront
e: ja
necb
ront
e: v
illet
tebl
ackm
ore:
ere
ma
blac
kmor
e: s
prin
ghav
enbl
ackm
ore:
lorn
ael
iot:
felix
elio
t: da
niel
elio
t: ad
amga
skel
l: w
ives
gask
ell:
ruth
gask
ell:
love
rsdi
cken
s: b
leak
dick
ens:
exp
ecta
tions
dick
ens:
oliv
erst
even
son:
cat
riona
brad
don:
aud
ley
brad
don:
que
stbr
addo
n: fo
rtun
eha
rdy:
jude
hard
y: te
ssha
rdy:
mad
ding
war
d: a
she
war
d: h
arve
stco
llins
: wom
anco
llins
: bas
ilco
llins
: leg
acy
barc
lay:
rosa
ryba
rcla
y: p
oste
rnba
rcla
y: la
dies
fors
ter:
room
fors
ter:
how
ards
fors
ter:
ang
els
giss
ing:
war
burt
ongi
ssin
g: u
ncla
ssed
giss
ing:
wom
enja
mes
: am
bass
ador
sja
mes
: mus
eja
mes
: hud
son
trollo
pe: a
ngel
trollo
pe: p
hine
as trollo
pe: w
arde
ndo
yle:
mic
ahdo
yle:
hou
nddo
yle:
lost
hagg
ard:
she
alla
nha
ggar
d: m
ist
hagg
ard:
min
esst
even
son:
arr
owst
even
son:
isla
ndki
plin
g: c
apta
ins
kipl
ing:
kim
kipl
ing:
ligh
tch
este
rton
: thu
rsda
ych
este
rton
: inn
ocen
cech
este
rton
: nap
oleo
nbu
rnet
t: ga
rden
burn
ett:
prin
cess
burn
ett:
lord
war
d: m
illy
mor
ris: w
ater
mor
ris: w
ood
mor
ris: r
oots
10 20 50 100
200
500
1000
2000
5000
1000
0
020
4060
8010
0
English Corpus | L2 normalization | PAM clustering
number of mfw
adju
sted
Ran
d in
dex
(%)
Cosine DeltaL1 2−DeltaBurrows (L1) DeltaQuadratic (L2) DeltaL4−Delta
Evert et al. (2015, 2017)
●
●●
●
●●●
●●●●
●
●●
●
●
●●●
●
●
●
●
●●
●●●●
●
●●
●
●●
●
●
●●●
●●
●●
●●
●
●
●●
●
●●●
●
●
●
●
●●●●●●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●
● ●
●
● ●●
●
●●
●
●●
●●
●● ●
●
●●● ●
●
●●
●
●
●●●
●●●●●●
●●●●
●●
●
●
●
●
●●
●●●●●
●
●
●
●●●
● ●
●●●●●●●
●
●●●●●●
●●●●●●
●
●
●
●
●●
●
●
●●●●
●
●●
●
●
●
●●●●●●
●
●
●●
●
●●●●
●●●
●
●●●
●●
●●●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●●●
●●
●
●●
●
●
●
●
●●●
●
●
●
●
●●●●●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
nn /
Tad
ja /
Tno
min
al /
Tfin
ites
/ Spa
st /
Fpa
ssive
/ V
mod
als
/ Vim
pera
tives
/ S
inte
rroga
tives
/ S
coor
dina
tion
/ Tsu
bord
inat
ion
/ Tpr
onou
ns /
Tpl
ace
adv
/ Ttim
e ad
v / T
adv
them
e / T
Hte
xt th
eme
/ TH
obj t
hem
e / T
Hve
rb th
eme
/ TH
subj
them
e / T
Hpr
ep /
Tm
odal
adv
/ T
cont
ract
ions
/ T
collo
quia
lism
/ T
title
s / T
lexi
cal d
ensi
tyle
xica
l TTR
toke
n / S
−5
0
5
z−sc
ore
= st
anda
rdize
d re
lativ
e fre
quen
cy
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
discriminant score
dens
ity
DE: origDE: transEN: origEN: trans
www.stefan-evert.de/PUB/EvertNeumann2017/Diwersy et al. (2014); Evert & Neumann (2017)
● DEEN
origtrans
−0.3
−0.1
0.1
0.2
0.3
standardized z−scores | L2 normalization
C. Brontë: Jane EyreC. Brontë: Shirley
Interpretation of dimension weights§ standard approach based on
magnitude and sign of weights(EN on positive side of axis)
§ interprets features as correlatedrather than complementary
§ better approach: what does each feature contribute to the LDA positions of texts?
§ reveals entirely different patterns
§ correlated features help LDA to reduce within-group variance
−0.2
0.0
0.2
EN / D
E discriminant
nn_Tad
ja_T
nomina
l_T
finite
s_Spa
st_F
passi
ve_V
modals
_V
impe
rative
s_S
interr
ogati
ves_
S
coord
inatio
n_T
subo
rdina
tion_
T
prono
uns_
T
place
.adv_
T
time.a
dv_T
adv.t
heme_
TH
text.th
eme_
TH
obj.th
eme_
TH
verb.
theme_
TH
subj.
theme_
THpre
p_T
modal.
adv_
T
contr
actio
ns_T
colloq
uialism
_T
titles_
T
lexica
l.den
sity
lexica
l.TTR
token
_S
norm
alize
d fe
atur
e we
ight
s
−0.2
0.0
0.2
weight
nn /
T
(−) a
dja
/ T
nom
inal
/ T
(−) f
inite
s / S
(−) p
ast /
F
(−) p
assi
ve /
V
(−) m
odal
s / V
(−) i
mpe
rativ
es /
S
(−) i
nter
roga
tives
/ S
(−) c
oord
inat
ion
/ T
subo
rdin
atio
n / T
(−) p
rono
uns
/ T
plac
e ad
v / T
time
adv
/ T
adv
them
e / T
H
text
them
e / T
H
(−) o
bj th
eme
/ TH
verb
them
e / T
H
subj
them
e / T
H
prep
/ T
(−) m
odal
adv
/ T
cont
ract
ions
/ T
collo
quia
lism
/ T
title
s / T
lexi
cal d
ensi
ty
lexi
cal T
TR
toke
n / S
−1
0
1
2
−1
0
1
2
DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN
cont
ribut
ion
to a
xis
scor
es
groupDEEN
DE / EN discriminant (original texts)What are the characteristic words?§ supervised recursive feature elimination➞ 233 words as features
§ not just mfw, but none unique to one author§ with, so, t, But, And, upon, don, head,
Then, looking, almost, indeed, nor, …,XXXVII (df=34), XLI (df=29), XLIII (df=26),hereabout (df=11), vilest (df=15), contours (df=9), Ecod (df=4), …
§ validation for DE: new novels from same authors: 97% accuracy
Work in progress§ contribution of features to
silhouette width of clustering§ assess relevance to each author§ identify features responsible for
mis-classifications
document frequency (# novels)
ward
: milly
ward
: har
vest
ward
: ash
eha
ggar
d: m
ines
hagg
ard:
mis
tha
ggar
d: s
heal
lan
giss
ing:
unc
lass
edgi
ssin
g: w
arbu
rton
giss
ing:
wom
ench
este
rton:
nap
oleo
nch
este
rton:
inno
cenc
ech
este
rton:
thur
sday
gask
ell:
love
rsga
skel
l: w
ives
gask
ell:
ruth
trollo
pe: w
arde
ntro
llope
: ang
eltro
llope
: phi
neas
burn
ett:
lord
burn
ett:
gard
enbu
rnet
t: pr
ince
ssja
mes
: hud
son
jam
es: m
use
jam
es: a
mba
ssad
ors
stev
enso
n: is
land
stev
enso
n: a
rrow
brad
don:
fortu
nebr
addo
n: a
udle
ybr
addo
n: q
uest
lytto
n: k
enel
mly
tton:
nov
elly
tton:
wha
tba
rcla
y: ro
sary
barc
lay:
ladi
esba
rcla
y: p
oste
rndi
cken
s: o
liver
stev
enso
n: c
atrio
nadi
cken
s: e
xpec
tatio
nsdi
cken
s: b
leak
hard
y: m
addi
ngha
rdy:
jude
hard
y: te
ssel
iot:
adam
elio
t: fe
lixel
iot:
dani
elco
relli:
sat
ancb
ront
e: s
hirle
yco
relli:
inno
cent
core
lli: ro
man
cecb
ront
e: ja
necb
ront
e: v
illette
collin
s: b
asil
collin
s: le
gacy
collin
s: w
oman
kipl
ing:
kim
kipl
ing:
ligh
tki
plin
g: c
apta
ins
mer
edith
: fev
erel
mer
edith
: mar
riage
mer
edith
: ric
hmon
dfo
rste
r: ho
ward
sfo
rste
r: an
gels
fors
ter:
room
blac
kmor
e: s
prin
ghav
enbl
ackm
ore:
lorn
abl
ackm
ore:
ere
ma
mor
ris: r
oots
mor
ris: w
ood
mor
ris: w
ater
doyl
e: m
icah
doyl
e: h
ound
doyl
e: lo
stth
acke
ray:
pen
denn
isth
acke
ray:
esm
ond
thac
kera
y: v
irgin
ians
Silh
ouet
te w
idth
si
0.0
0.2
0.4
0.6
0.8
1.0
Silh
ouet
te w
idth
s (E
nglis
h, z−s
core
s, B
D, n
=100
0, W
ard)
n =
7525
clu
ster
s C
j
j : n
j | av
e i∈C
j s i
1 :
3 |
0.3
3
2 :
3 |
0.1
1
3 :
3 |
0.5
1
4 :
3 |
0.3
3
5 :
3 |
0.2
4
6 :
3 |
0.2
4
7 :
3 |
0.0
6
8 :
3 |
0.1
6
9 :
6 |
0.0
7
10 :
3 |
0.1
7
11 :
3 |
0.2
8
12 :
4 |
0.0
4
13 :
3 |
0.0
6
14 :
3 |
0.4
0
15 :
3 |
0.1
016
: 2
| 0
.08
17 :
3 |
0.1
6
18 :
3 |
0.1
1
19 :
3 |
0.1
9
20 :
3 |
0.1
8
21 :
3 |
0.2
2
22 :
3 |
0.1
8
23 :
3 |
0.1
824
: 2
| 0
.16
25 :
1 |
0.0
0
Reliability of the clustering§ bootstrapping texts not applicable to
clustering & high-dimen. feature space§ bootstrapping features ➞ unclear§ biggest factor: choice of authors
(empirial study on Gutenberg archive)
Bootstrapping latent dimensions§ bootstrapping / cross-validation can be used to assess stability of
LDA &PCA dimensions (applicable because of small # of features)§ LDA axis “wobbles” by approx. 10° across folds§ moderate variability of feature weights: σ < 0.05§ but positions of texts on LDA axis are stable (r = .987)