The Theory behindKeyword Analysis
Václav CvrčekWorkshop on Quantitative Text Analysis for SSH
Brown UniversityApril 8, 2016
IntroductionIj
Introduction
Text interpretation
▶ at the core of the humanities’ mission
▶ our interpretation + other people’s interpretation▶ interpretation with minimum amount of extra-textual
information and intuition▶ frame of reference, scheme, expectations, communicative
norms…▶ there is no objective interpretation – depends on point of view
(recipient)
Introduction
Text interpretation
▶ at the core of the humanities’ mission▶ our interpretation + other people’s interpretation
▶ interpretation with minimum amount of extra-textualinformation and intuition
▶ frame of reference, scheme, expectations, communicativenorms…
▶ there is no objective interpretation – depends on point of view(recipient)
Introduction
Text interpretation
▶ at the core of the humanities’ mission▶ our interpretation + other people’s interpretation▶ interpretation with minimum amount of extra-textual
information and intuition
▶ frame of reference, scheme, expectations, communicativenorms…
▶ there is no objective interpretation – depends on point of view(recipient)
Introduction
Text interpretation
▶ at the core of the humanities’ mission▶ our interpretation + other people’s interpretation▶ interpretation with minimum amount of extra-textual
information and intuition▶ frame of reference, scheme, expectations, communicative
norms…
▶ there is no objective interpretation – depends on point of view(recipient)
Introduction
Text interpretation
▶ at the core of the humanities’ mission▶ our interpretation + other people’s interpretation▶ interpretation with minimum amount of extra-textual
information and intuition▶ frame of reference, scheme, expectations, communicative
norms…▶ there is no objective interpretation – depends on point of view
(recipient)
Language corpusIj
What is a corpus?
▶ sample of naturallyoccurring written texts ortranscribed speeches
▶ stored electronically(searchable)
▶ basis for linguistic analysisand description
CADS = corpus assisted discourse studies
“A Needle in a Haystack” (collaborative research project ofBrown and Charles University)
▶ how language reflects the changing nature of the society?
▶ how different is the interpretation of the contemporary andhistorical reader?
▶ how can we test the limits of the corpus-based quantitativeanalysis of text?
http://brown.edu/research/projects/needle-in-haystack/
CADS = corpus assisted discourse studies
“A Needle in a Haystack” (collaborative research project ofBrown and Charles University)
▶ how language reflects the changing nature of the society?▶ how different is the interpretation of the contemporary and
historical reader?
▶ how can we test the limits of the corpus-based quantitativeanalysis of text?
http://brown.edu/research/projects/needle-in-haystack/
CADS = corpus assisted discourse studies
“A Needle in a Haystack” (collaborative research project ofBrown and Charles University)
▶ how language reflects the changing nature of the society?▶ how different is the interpretation of the contemporary and
historical reader?▶ how can we test the limits of the corpus-based quantitative
analysis of text?
http://brown.edu/research/projects/needle-in-haystack/
Prominent itemsIj
Interpretation and prominence
How do we start with interpretation?
▶ what is striking in a text?▶ topics, motives, themes – expressed by words▶ interaction between words, topics…▶ function/meaning of words, topics…▶ minimize researcher bias
Why don’t you simply count them?Top 10 lemmas:
thebeofa
andtoheit
havein
G. Orwell: 1984
theandtobeofwe
thata
ourin
SOTU 2009–2016
theandbeoftoahe
havein
they
JRRT: Hobbit
Why don’t you simply count them?Top 10 lemmas:
thebeofa
andtoheit
havein
G. Orwell: 1984
theandtobeofwe
thata
ourin
SOTU 2009–2016
theandbeoftoahe
havein
they
JRRT: Hobbit
Thematic concentration
▶ content words with “abnormal” frequency
▶ Zipf’s word-frequency distribution
▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and
synsemantic branch▶ content words above the h-point are TC words
▶ © Popescu & Altmann 2006
Thematic concentration
▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution
▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and
synsemantic branch▶ content words above the h-point are TC words
▶ © Popescu & Altmann 2006
Thematic concentration
▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution
▶ h-point: rank = frequency
▶ h-point – approximately separates autosemantic andsynsemantic branch
▶ content words above the h-point are TC words▶ © Popescu & Altmann 2006
Thematic concentration
▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution
▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and
synsemantic branch
▶ content words above the h-point are TC words▶ © Popescu & Altmann 2006
Thematic concentration
▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution
▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and
synsemantic branch▶ content words above the h-point are TC words
▶ © Popescu & Altmann 2006
Thematic concentration
▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution
▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and
synsemantic branch▶ content words above the h-point are TC words
▶ © Popescu & Altmann 2006
Thematic concetration
TC words
1984:Winston, say, know, Party, face, word, O’Brien, seem, look, never,think, moment, always, hand, year, way, long, now, eye, day,possible, war…
SOTU:year, job, work, America, new, people, american, know, need, help,country, business, time, world, economy, family, right, tax,Congress, nation…
Hobbit:come, go, Bilbo, see, dwarves, time, long, make, think, great,good, know, far, still, goblin, find, way, look, little, light…
TC: discussion
Pros and cons of TC words
+ objective – based on frequency distribution
+ text analytical applications: comparing texts according to theirthematic compactness
+ no reference corpus required- set of TC words is invariant- “interpretation without interpretor” – interpretation always
depends on the point of view
TC: discussion
Pros and cons of TC words
+ objective – based on frequency distribution+ text analytical applications: comparing texts according to their
thematic compactness
+ no reference corpus required- set of TC words is invariant- “interpretation without interpretor” – interpretation always
depends on the point of view
TC: discussion
Pros and cons of TC words
+ objective – based on frequency distribution+ text analytical applications: comparing texts according to their
thematic compactness+ no reference corpus required
- set of TC words is invariant- “interpretation without interpretor” – interpretation always
depends on the point of view
TC: discussion
Pros and cons of TC words
+ objective – based on frequency distribution+ text analytical applications: comparing texts according to their
thematic compactness+ no reference corpus required- set of TC words is invariant
- “interpretation without interpretor” – interpretation alwaysdepends on the point of view
TC: discussion
Pros and cons of TC words
+ objective – based on frequency distribution+ text analytical applications: comparing texts according to their
thematic compactness+ no reference corpus required- set of TC words is invariant- “interpretation without interpretor” – interpretation always
depends on the point of view
Keyword analysisIj
Keywords and KWA
Keywords
▶ homonymous term1
▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)
1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.
Keywords and KWA
Keywords
▶ homonymous term1
▶ words with higher relative frequency in a text
▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)
1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.
Keywords and KWA
Keywords
▶ homonymous term1
▶ words with higher relative frequency in a text▶ based on comparison with reference corpus
▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)
1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.
Keywords and KWA
Keywords
▶ homonymous term1
▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)
1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.
Keywords and KWA
Keywords
▶ homonymous term1
▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test
A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)
1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.
Obtaining keywords – algorithm
Procedure
▶ count frequency of each word – most frequent words are the,of, was…
▶ compare it with a frequency of the same word in a corpus▶ use statistical tests: χ2, log-likelihood or Fisher to find out if
the difference is significant▶ interpret top X most prominent keywords
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Obtaining keywords – algorithm
Procedure
▶ count frequency of each word – most frequent words are the,of, was…
▶ compare it with a frequency of the same word in a corpus
▶ use statistical tests: χ2, log-likelihood or Fisher to find out ifthe difference is significant
▶ interpret top X most prominent keywords
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Obtaining keywords – algorithm
Procedure
▶ count frequency of each word – most frequent words are the,of, was…
▶ compare it with a frequency of the same word in a corpus▶ use statistical tests: χ2, log-likelihood or Fisher to find out if
the difference is significant
▶ interpret top X most prominent keywords
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Obtaining keywords – algorithm
Procedure
▶ count frequency of each word – most frequent words are the,of, was…
▶ compare it with a frequency of the same word in a corpus▶ use statistical tests: χ2, log-likelihood or Fisher to find out if
the difference is significant▶ interpret top X most prominent keywords
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Obtaining keywords – algorithm
Procedure
▶ count frequency of each word – most frequent words are the,of, was…
▶ compare it with a frequency of the same word in a corpus▶ use statistical tests: χ2, log-likelihood or Fisher to find out if
the difference is significant▶ interpret top X most prominent keywords
Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.
Note on significance and effect size
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Note on significance and effect size
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Note on significance and effect size
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)
▶ relevance – importance of the difference (for interpretation)
▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Note on significance and effect size
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
Note on significance and effect size
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests
2. ranking of KWs – task for a different metric
Note on significance and effect size
Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)
Metrics used to calculate keyness
▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)
▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:
1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric
DIN coefficientVariation on the Sørensen–Dice’s coefficient2:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN
▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ identical value of DIN for words appearing in a target textonly (!)
▶ useful for ranking of KWs (not for their identification!)
2cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient2:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the RefC)
▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ identical value of DIN for words appearing in a target textonly (!)
▶ useful for ranking of KWs (not for their identification!)
2cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient2:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)
▶ 100 (= when a word is present only in the target corpus)▶ represents the proportion of the difference of relative
frequencies to their mean (× 50)▶ identical value of DIN for words appearing in a target text
only (!)▶ useful for ranking of KWs (not for their identification!)
2cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient2:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ identical value of DIN for words appearing in a target textonly (!)
▶ useful for ranking of KWs (not for their identification!)
2cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient2:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ identical value of DIN for words appearing in a target textonly (!)
▶ useful for ranking of KWs (not for their identification!)
2cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient2:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ identical value of DIN for words appearing in a target textonly (!)
▶ useful for ranking of KWs (not for their identification!)
2cf. Hofland–Johansson (1982).
DIN coefficientVariation on the Sørensen–Dice’s coefficient2:
DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)
▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)
▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)
▶ identical value of DIN for words appearing in a target textonly (!)
▶ useful for ranking of KWs (not for their identification!)2cf. Hofland–Johansson (1982).
Example 1: Grammatical wordsKeywords from all Husák’s New Year’s Addresses
0 100 200 300 400 500 600
010
020
030
040
050
060
0All NYA (gram. words highlighted)
Dice (rank)
Log−
likel
ihoo
d (r
ank)
1987
spoluobčané
1986
rozkvétala
přičiňmestřízlivým
domovům
vzkvétalapozdravuji
vstupujeme
1983
drazí
1982
dařila
udělejme
přeji
připomeneme
dopadyuplynulým
xvii
pokročili
novoroční
zamýšlíme
zdravím
pohodupohodě
svědomitou
posílámeoptimismempoděkovatxvi
vzestupný
opíráme
generacím
přikládámekvalitněji
spokojený
vážení
náročněvašim
zdravímeuplynulý
přátelé
důvěrou
opravňujínejspolehlivější
rozvíjelo
prožili
tvořivá
1981
hodnotíme
vzpomínat
plodem
nestraníkůvýhodnounastávajícím
rodinném
vážené
upevnili
podporujeme
přátele
soudružky
poctivou
opravňuje
spokojenost
realisticky
srdečně
energičtěji
přispěli
nadcházejícím
odvrácení
díváme
spokojenosti
bratrskému
přátelům
odhodláním
klademe
efektivněji
dialogu
upřímně
osvobozováníudrželi
uplynulého
oceňujeme
rozkvět
prošli
činorodéobětavávykonanouzačneme
všestranná
vážíme
obětavé
hrozby
tvořivou
připomněli
vyvrcholeníÚspěšně
přejemejdeme
obav
věříme
spolehlivou
mírový
příslušníkům
vstupu
uvědomujeme
hrdíškolskýchdynamiku
jistot
osmé
prahu
usilujeme
prostá
chciuplynulém
připomínat
přáteli
ústavech
historickýmispokojeněvykonaliodhodláni
konstruktivní
občanům
tužby
vyspělou
pramení
jménem
přáním
osvobozenecký
pozdravy
hranicemi
vůlí
inteligenci
pozitivníchodpovědně
přestavběsložitá
rozloučili
dobrým
horečného
měnydůstojněrovnováhu
jistoty
slabákontinentěbratrský
rokem
považujeme
podílelišťastného
dějinnéhladiny
poctivédovolte
děkuji
štěstí
složitou
Čechů
vyspělostnadcházející
náročné
pevným
překonávat
šťastný
zastupitelských
rozdílným
slováků
národností
vítěznou
pozdravil
spojenectví
desetiletí
katastrofy
upřímné
dnešním
angažovanost
samozřejmé
kriticky
zlepšovat
úspěšný
zdraví
pozvednout
složitost
přesvědčeni
spravedlnost
rozkvětu
budeme
Československá
vyžadovat
naléhavé
ženámtvořivé
jaderné
hrdostikonfrontacezasedáních
osobním
naléhavěhovořímedařilo
hrdostí
progresivních
důvěry
našim
příznivý
životě
rolníkům
společenství
minulého
sborů
rozvíjela
ústředního
překážek
vám
stručněmládeži
mírového
zhodnotil
společným
vlast
přesvědčen
soudružské
bratrskými
osvobozeneckého
otevřela
mírových
důvěra
dělníkům
pevné
podnětem
abyste
metra
obětavou
celkověpřekonáváníspojenci
urychlení
drahé
potvrdily
náročnýzmařit
můžemevlasti
přestavby
odzbrojení
uplynulých
úsecích
fronty
uskutečňovat
mírovému
varšavské
zajistili
výsledkům
dosažené
zřízením
Československo
reálné
spojeno
Československa
dopravě
úsekůpodílí
nového
obětavě
užívání
službách
uvolňování
prohlubovat
udržet
aktivně
překonání
prohlubování
zápasu
všestranný
krizových
nejvyšších
vzájemně
ekonomikunemálo
světě
pokrokovýmnezbytnástarat
úspěšného
zničení
zápas
žít
usilovat
uspokojením
čelit
továrnách
rok
občany
vrstev
zdravotnických
důraz
částech
kultuře
náročných
vyžadují
dobré
1978
pokračoval
významných
blahovšestrannéhoodkazu
svazem
příznivé
našeho
podporuje
bratrské
energetickéhospodařitvstříc
smyslem
museli
přispívat
široká
jaderných
životních
solidaritutěžkosti
státy
úspěchů
srdce
pokroku
radost
správě
příštích
dalšími
říci
důkazzdůraznit
soudruzi
odkaz
socialistického
složitých
napětí
bratrských
události
rovnosti
právem
prospěchu
zlepšování
vědomím
minulém
občanů
vnější
zdrojem
dosáhli
víme
zdravotnictví
život
zabezpečit
mírovou
stupních
dobrou
všem
stupňů
plnou
loňském
uskutečňování
láskou
hodně
zápase
soužití
vás
lidstvo
lépe
socialistickými
zřízení
nimiž
národně
naší
správnou
armádou
duchovní
dobrýchnovém
pevnou
letošním
kupředu
prosinci
sovětským
povinnosti
lidu
záměry
chceme
upevňování
současných
složek
sociálních
výsledky
ovzduší
náš
vývojem
našich
abychom
potřebám
přínospotvrzují
perspektivy
odhodlání
československého
všestranné
všude
roku
krok
vůle
pracovišti
hmotné
pětiletky
generace
zásluhou
významné
republika
nezávislosti
aktivita
československé
stavbách
společenskýotevřeně
pokračovat
rychleji
životní
nadále
budoucnosti
řešit
dobrá
xiv
mezinárodních
pracovat
dobrýbezpečnost
společný
podporu
světem
pracovištích
vlastenectví
abych
cestou
věnovat
jsme
cesta
mírové
přání
národní
lepší
úspěchy
rostoucí
rozvoji
dalšího
pozitivní
našimi
máme
životanaše
milióny
rozvíjet
výsledků
zbrojení
pokrok
státu
našemu
důsledně
návrhy
podmínkách
volby
vztazích
dalším
šesté
komunistické
oblastech
školstvíveškeré
zeměmi
občané
roce
ozbrojených
zajištění
sovětskou
zlepšeníjednoty
zachování
úsilí
zájmům
úroveň
dalších
inteligencezávěrykaždém
socialistických
vysoce
mezinárodním
jednou
postup
díky
vysokounedostatkůmateriální
úspěšně
vývoj
evropě
příští
kterém
úkoly
vše
socialistická
plní
pracujícího
potřebné
krize
upevnění
rozvoj
cestu
odpovídápostupujakýdalšímu
svým
společné
mezinárodní
program
naši
musíme
výročí
národů
spoluprácemnoho
ekonomického
cíle
hospodářství
možností
úrovně
politika
států
bezpečnosti
plně
velkou
kterýmlidstva
nás
společnosti
vědeckých
splnění
celémzájmu
nedostatky
rozvoje
prací
přátelství
práci
znovu
našem
míru
svobody
cen
socialistické
spolupráci
politice
tvůrčí
problémů
postavení
dále
sjezd
výboru
spolu
všech
problémy
úkolů
národy
sociální
hospodářského
organizací
potřeb
politiku
společenské
země
nových
orgánůvelký
lid
celého
růstuvztahy
celé
další
i
plněnísíly
lidí
nám
zasedání
sovětskéhopracujících
řešení
sil
politiky
pracovní
svousjezdu
proto
aby
let
svazu
práce
všechny
které
pro
si
a
to
strany
ve
s
19871986
19831982
1981
našim
vám
abyste
1978
našeho
vstříc
všem
vás
nimiž
naší
náš
našich
abychom
abych
našimi
naše
našemu
veškeré
každém
díkykterém
vše
jaký
svým
naši
kterým
nás
našem
všech
i
nám
svou
proto
aby
všechny
které
pro
si
a
tove
s
Tools for KWAIj
KWordshttp://kwords.korpus.cz
KWords: analysed text
KWords: list of keywordsTwo types of prominent units: keywords and thematic concentration
Dispersion plot
SOTU 2016: terrorism × economy
Keyword links
Keyword links according to the size of the window
Distant KW Links = KWs appearing in distant context (-15;-5)and (5;15); these KW links indicate that the themesrepresented by these KWs may form adiscourse-semantic network
Immediate KW Links (multi-word KWs) = co-occurrence of twoor more KWs within immediate or near context(-2;2); these adjacent KW may signal multi-word KWunit (e.g. American people, better politics)
custom = arbitrary size of the span
Keyword links
Keyword links according to the size of the window
Distant KW Links = KWs appearing in distant context (-15;-5)and (5;15); these KW links indicate that the themesrepresented by these KWs may form adiscourse-semantic network
Immediate KW Links (multi-word KWs) = co-occurrence of twoor more KWs within immediate or near context(-2;2); these adjacent KW may signal multi-word KWunit (e.g. American people, better politics)
custom = arbitrary size of the span
Keyword links
Keyword links according to the size of the window
Distant KW Links = KWs appearing in distant context (-15;-5)and (5;15); these KW links indicate that the themesrepresented by these KWs may form adiscourse-semantic network
Immediate KW Links (multi-word KWs) = co-occurrence of twoor more KWs within immediate or near context(-2;2); these adjacent KW may signal multi-word KWunit (e.g. American people, better politics)
custom = arbitrary size of the span
isil
qaed
a
bipa
rtisa
nha
rdwor
king
wea
ken
folk
sterro
rists
american
sterro
rist
voices
stron
gest
america
plane
tsta
mpallies
leadersh
ip
democr
acy
climate
trends
econom
y
american
harder
retirement
fellow
businesses
politics
our
electedal
citizensnationjobstonightcongress
voteeverybody
spiritagreeworkers
families
opportunity
happen
energy
change
security
basic
jobnearly
mili
tary
belie
ve
we
better
us
every
futureleteveryth
ingstatescarelot
countryworld
want
keep
need
work
make
who
justyear
years
people
new
Comparison
For comparing texts – time series (SOTU)
State of the UnionIj
Obama’s State of the Union Address
Eight addresses (2009–2016)
2009 2010 2011 2012 2013 2014 2015 2016Tokens 6346 8024 7611 7743 7427 7506 7479 6671Types 1347 1531 1505 1555 1561 1598 1521 1422
source: http://www.whitehouse.gov
Average length N = 7351Average vocabulary V(N) = 1505
Reference corpus: British National Corpus (BNC)
Permanent KWs (Key KWs)
KWs appearing in all eight addresses:america, american, americans, businesses, congress, country,economy, jobs, nation, tonight
KWs appearing in six or seven addresses:energy, every, let, more, new, people, democrats, families, make,millions, republicans, tax, why
SOTU: Topics – economy/politics
Reference corpus in KWAIj
Reference corpus in KWA
What does reference corpus affect?
size: bigger reference corpus ⇒ more KWs
composition: different reference corpora represent different readers(conceptualized reader)
▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from
the past, with specific background…)
Reference corpus in KWA
What does reference corpus affect?
size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers
(conceptualized reader)
▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from
the past, with specific background…)
Reference corpus in KWA
What does reference corpus affect?
size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers
(conceptualized reader)▶ balanced corpus ∼ general reader
▶ specialized corpus ∼ specific reader (e.g. fromthe past, with specific background…)
Reference corpus in KWA
What does reference corpus affect?
size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers
(conceptualized reader)▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from
the past, with specific background…)
Different readers = different interpretations
Contrastive KWA analysis
▶ different RefCs: interferences – time, style, topic differences
▶ New Year’s Addresses of the last communist president of theCzechoslovakia Gustáv Husák (1975–1989)
▶ contemporary reader (SYN2010) × reader from the past(Totalita)
▶ State of the Union addresses of Barack Obama (2009–2016)
▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)
Different readers = different interpretations
Contrastive KWA analysis
▶ different RefCs: interferences – time, style, topic differences▶ New Year’s Addresses of the last communist president of the
Czechoslovakia Gustáv Husák (1975–1989)
▶ contemporary reader (SYN2010) × reader from the past(Totalita)
▶ State of the Union addresses of Barack Obama (2009–2016)
▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)
Different readers = different interpretations
Contrastive KWA analysis
▶ different RefCs: interferences – time, style, topic differences▶ New Year’s Addresses of the last communist president of the
Czechoslovakia Gustáv Husák (1975–1989)▶ contemporary reader (SYN2010) × reader from the past
(Totalita)
▶ State of the Union addresses of Barack Obama (2009–2016)
▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)
Different readers = different interpretations
Contrastive KWA analysis
▶ different RefCs: interferences – time, style, topic differences▶ New Year’s Addresses of the last communist president of the
Czechoslovakia Gustáv Husák (1975–1989)▶ contemporary reader (SYN2010) × reader from the past
(Totalita)▶ State of the Union addresses of Barack Obama (2009–2016)
▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)
Different readers = different interpretations
Contrastive KWA analysis
▶ different RefCs: interferences – time, style, topic differences▶ New Year’s Addresses of the last communist president of the
Czechoslovakia Gustáv Husák (1975–1989)▶ contemporary reader (SYN2010) × reader from the past
(Totalita)▶ State of the Union addresses of Barack Obama (2009–2016)
▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)
Husák: Influence of the reference corpora
What happens if we compare texts to different RefCs?
▶ the inventory of KWs does not differ substantially
▶ the difference is in ranking (prominence of KWs – DIN)
Historical reader (Totalita)
→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.
Contemporary reader (SYN2010)
→ connected with historical events▶ ideology▶ archaisms, historism
Husák: Influence of the reference corpora
What happens if we compare texts to different RefCs?
▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)
Historical reader (Totalita)
→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.
Contemporary reader (SYN2010)
→ connected with historical events▶ ideology▶ archaisms, historism
Husák: Influence of the reference corpora
What happens if we compare texts to different RefCs?
▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)
Historical reader (Totalita)
→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.
Contemporary reader (SYN2010)
→ connected with historical events▶ ideology▶ archaisms, historism
Husák: Influence of the reference corpora
What happens if we compare texts to different RefCs?
▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)
Historical reader (Totalita)
→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.
Contemporary reader (SYN2010)
→ connected with historical events▶ ideology▶ archaisms, historism
Detailed comparison – 3 thematic groups
Cold war: mír, míru, mírová, mírové, mírového, mírovému,mírovou, mírový, mírových, mírovými, mírumilovné,mírumilovných, mírumilovným; napětí; odzbrojení,výzbroje, zbrojení, zbrojením, ozbrojených
Collective possession: náš, naše, našeho, našem, našemu, naši,naší, našich, našim, naším, našimi
Ideo markers: socialismu, socialismus, socialistická, socialistické,socialistického, socialistickém, socialistickému,socialistickou, socialistický, socialistických,socialistickým, socialistickými; komunismu,komunisté, komunistů, ksč, komunistům, komunistykomunistická, komunistické, komunistickým
Cold war40
5060
7080
9010
0
Cold War KWs in SYN−KWA and TOT−KWA
Year
DIN
SYN−KWATOT−KWA
1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
Fidler–Cvrček (2015)
Collective possession65
7075
8085
9095
KWs "our" in SYN−KWA and TOT−KWA
Year
DIN SYN−KWA
TOT−KWA
1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
Fidler–Cvrček (2015)
Ideological markers30
4050
6070
8090
100
Ideological markers KWs in SYN−KWA and TOT−KWA
Year
DIN SYN−KWA
TOT−KWA
1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
Fidler–Cvrček (2015)
Reader from the past × contemporary reader
Totalita
1. style & genre: fellowcitizens, friends
2. propaganda: blossom,succeed
3. 1st pers. sg. (greet, wish)
SYN2010
1. ideology-related:comrade(s), socialist,five-year plan
2. period-specific: brotherly,liberating, feverish, dutiful,imperialistic
Difference in sensitivity
▶ lines for both readers have similar tendencies
▶ contemporary reader has higher overall level of prominence(DIN)
▶ tendencies are more visible for reader from the past▶ contemporary reader cannot distinguish subtle changes
(overwhelmed by unusual lexicon)▶ astute reader from the past might notice slight shifts in the
discourse
Difference in sensitivity
▶ lines for both readers have similar tendencies▶ contemporary reader has higher overall level of prominence
(DIN)
▶ tendencies are more visible for reader from the past▶ contemporary reader cannot distinguish subtle changes
(overwhelmed by unusual lexicon)▶ astute reader from the past might notice slight shifts in the
discourse
Difference in sensitivity
▶ lines for both readers have similar tendencies▶ contemporary reader has higher overall level of prominence
(DIN)▶ tendencies are more visible for reader from the past
▶ contemporary reader cannot distinguish subtle changes(overwhelmed by unusual lexicon)
▶ astute reader from the past might notice slight shifts in thediscourse
Difference in sensitivity
▶ lines for both readers have similar tendencies▶ contemporary reader has higher overall level of prominence
(DIN)▶ tendencies are more visible for reader from the past▶ contemporary reader cannot distinguish subtle changes
(overwhelmed by unusual lexicon)
▶ astute reader from the past might notice slight shifts in thediscourse
Difference in sensitivity
▶ lines for both readers have similar tendencies▶ contemporary reader has higher overall level of prominence
(DIN)▶ tendencies are more visible for reader from the past▶ contemporary reader cannot distinguish subtle changes
(overwhelmed by unusual lexicon)▶ astute reader from the past might notice slight shifts in the
discourse
2015 SOTU: General vs. specialist’s viewRank Keyword
1 rebekah2 bipartisan3 hardworking4 loopholes5 childcare6 republicans7 folks8 veterans9 diplomacy
10 striving11 terrorists12 americans13 infrastructure14 fastest15 democrats
Rank Keyword1 ben2 economics3 childcare4 rebekah5 west6 believed7 striving8 tools9 sick
10 spread11 write12 paid13 surely14 issues15 scientists
2015 SOTU: Differences
Specialist’s view:Rhetorics/style: believed, write, thing, toolsImportant topics: economics, leave, paid, childcare
General view:Political speeches: bipartisan, republicans, folks, veterans,diplomacy, terrorists, americans, infrastructure, democrats,combat, sanctionsTopical words: hardworking, loopholes, childcare, striving, fastest
2015 SOTU: Differences
Specialist’s view:Rhetorics/style: believed, write, thing, toolsImportant topics: economics, leave, paid, childcare
General view:Political speeches: bipartisan, republicans, folks, veterans,diplomacy, terrorists, americans, infrastructure, democrats,combat, sanctionsTopical words: hardworking, loopholes, childcare, striving, fastest
References▶ Cvrček, V. – Vondřička, P. (2013): KWords. FF UK. Praha. Available at:
<http://kwords.korpus.cz>.▶ Fidler, M. - Cvrček, V. (2015): A Data-Driven Analysis of Reader Viewpoints:
Reconstructing the Historical Reader Using Keyword Analysis. Journal of SlavicLinguistics 23(2), (p. 197–239).
▶ Gabrielatos, C. – Marchi, A. (2012): Keyness: appropriate metrics and practicalissues. Paper presented at the CADS International Conference, Bologna, Italy,September 2012 (www.gabrielatos.com/Presentations.htm).
▶ Hofland, K. – Johansson, S. (1982): Word frequencies in British and AmericanEnglish. Bergen: The Norwegian computing centre for the Humanities.
▶ Popescu, I. – Altmann, G. (2006): Some Aspects of Word Frequencies.Glottometrics 13, (p. 23–46).
▶ Scott, M. – Tribble, C. (2006): Textual patterns: Keyword and corpus analysis inlanguage education. Amsterdam: Benjamins.
▶ Wierzbicka, A. (1997): Understanding cultures through their keywords. English,Russian, Polish, German, and Japanese. Oxford, New York: Oxford UP.
▶ Williams, R. (1976/85): Keywords: A vocabulary of culture and society. NewYork: Oxford UP.
Thank you for your attention!