+ All Categories
Home > Documents > The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of...

The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of...

Date post: 06-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
97
The Theory behind Keyword Analysis Václav Cvrček Workshop on Quantitative Text Analysis for SSH Brown University April 8, 2016
Transcript
Page 1: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

The Theory behindKeyword Analysis

Václav CvrčekWorkshop on Quantitative Text Analysis for SSH

Brown UniversityApril 8, 2016

Page 2: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

IntroductionIj

Page 3: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Introduction

Text interpretation

▶ at the core of the humanities’ mission

▶ our interpretation + other people’s interpretation▶ interpretation with minimum amount of extra-textual

information and intuition▶ frame of reference, scheme, expectations, communicative

norms…▶ there is no objective interpretation – depends on point of view

(recipient)

Page 4: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Introduction

Text interpretation

▶ at the core of the humanities’ mission▶ our interpretation + other people’s interpretation

▶ interpretation with minimum amount of extra-textualinformation and intuition

▶ frame of reference, scheme, expectations, communicativenorms…

▶ there is no objective interpretation – depends on point of view(recipient)

Page 5: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Introduction

Text interpretation

▶ at the core of the humanities’ mission▶ our interpretation + other people’s interpretation▶ interpretation with minimum amount of extra-textual

information and intuition

▶ frame of reference, scheme, expectations, communicativenorms…

▶ there is no objective interpretation – depends on point of view(recipient)

Page 6: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Introduction

Text interpretation

▶ at the core of the humanities’ mission▶ our interpretation + other people’s interpretation▶ interpretation with minimum amount of extra-textual

information and intuition▶ frame of reference, scheme, expectations, communicative

norms…

▶ there is no objective interpretation – depends on point of view(recipient)

Page 7: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Introduction

Text interpretation

▶ at the core of the humanities’ mission▶ our interpretation + other people’s interpretation▶ interpretation with minimum amount of extra-textual

information and intuition▶ frame of reference, scheme, expectations, communicative

norms…▶ there is no objective interpretation – depends on point of view

(recipient)

Page 8: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Language corpusIj

Page 9: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

What is a corpus?

▶ sample of naturallyoccurring written texts ortranscribed speeches

▶ stored electronically(searchable)

▶ basis for linguistic analysisand description

Page 10: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

CADS = corpus assisted discourse studies

“A Needle in a Haystack” (collaborative research project ofBrown and Charles University)

▶ how language reflects the changing nature of the society?

▶ how different is the interpretation of the contemporary andhistorical reader?

▶ how can we test the limits of the corpus-based quantitativeanalysis of text?

http://brown.edu/research/projects/needle-in-haystack/

Page 11: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

CADS = corpus assisted discourse studies

“A Needle in a Haystack” (collaborative research project ofBrown and Charles University)

▶ how language reflects the changing nature of the society?▶ how different is the interpretation of the contemporary and

historical reader?

▶ how can we test the limits of the corpus-based quantitativeanalysis of text?

http://brown.edu/research/projects/needle-in-haystack/

Page 12: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

CADS = corpus assisted discourse studies

“A Needle in a Haystack” (collaborative research project ofBrown and Charles University)

▶ how language reflects the changing nature of the society?▶ how different is the interpretation of the contemporary and

historical reader?▶ how can we test the limits of the corpus-based quantitative

analysis of text?

http://brown.edu/research/projects/needle-in-haystack/

Page 13: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Prominent itemsIj

Page 14: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Interpretation and prominence

How do we start with interpretation?

▶ what is striking in a text?▶ topics, motives, themes – expressed by words▶ interaction between words, topics…▶ function/meaning of words, topics…▶ minimize researcher bias

Page 15: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Why don’t you simply count them?Top 10 lemmas:

thebeofa

andtoheit

havein

G. Orwell: 1984

theandtobeofwe

thata

ourin

SOTU 2009–2016

theandbeoftoahe

havein

they

JRRT: Hobbit

Page 16: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Why don’t you simply count them?Top 10 lemmas:

thebeofa

andtoheit

havein

G. Orwell: 1984

theandtobeofwe

thata

ourin

SOTU 2009–2016

theandbeoftoahe

havein

they

JRRT: Hobbit

Page 17: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Thematic concentration

▶ content words with “abnormal” frequency

▶ Zipf’s word-frequency distribution

▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and

synsemantic branch▶ content words above the h-point are TC words

▶ © Popescu & Altmann 2006

Page 18: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Thematic concentration

▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution

▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and

synsemantic branch▶ content words above the h-point are TC words

▶ © Popescu & Altmann 2006

Page 19: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Thematic concentration

▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution

▶ h-point: rank = frequency

▶ h-point – approximately separates autosemantic andsynsemantic branch

▶ content words above the h-point are TC words▶ © Popescu & Altmann 2006

Page 20: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Thematic concentration

▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution

▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and

synsemantic branch

▶ content words above the h-point are TC words▶ © Popescu & Altmann 2006

Page 21: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Thematic concentration

▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution

▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and

synsemantic branch▶ content words above the h-point are TC words

▶ © Popescu & Altmann 2006

Page 22: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Thematic concentration

▶ content words with “abnormal” frequency▶ Zipf’s word-frequency distribution

▶ h-point: rank = frequency▶ h-point – approximately separates autosemantic and

synsemantic branch▶ content words above the h-point are TC words

▶ © Popescu & Altmann 2006

Page 23: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Thematic concetration

Page 24: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

TC words

1984:Winston, say, know, Party, face, word, O’Brien, seem, look, never,think, moment, always, hand, year, way, long, now, eye, day,possible, war…

SOTU:year, job, work, America, new, people, american, know, need, help,country, business, time, world, economy, family, right, tax,Congress, nation…

Hobbit:come, go, Bilbo, see, dwarves, time, long, make, think, great,good, know, far, still, goblin, find, way, look, little, light…

Page 25: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

TC: discussion

Pros and cons of TC words

+ objective – based on frequency distribution

+ text analytical applications: comparing texts according to theirthematic compactness

+ no reference corpus required- set of TC words is invariant- “interpretation without interpretor” – interpretation always

depends on the point of view

Page 26: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

TC: discussion

Pros and cons of TC words

+ objective – based on frequency distribution+ text analytical applications: comparing texts according to their

thematic compactness

+ no reference corpus required- set of TC words is invariant- “interpretation without interpretor” – interpretation always

depends on the point of view

Page 27: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

TC: discussion

Pros and cons of TC words

+ objective – based on frequency distribution+ text analytical applications: comparing texts according to their

thematic compactness+ no reference corpus required

- set of TC words is invariant- “interpretation without interpretor” – interpretation always

depends on the point of view

Page 28: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

TC: discussion

Pros and cons of TC words

+ objective – based on frequency distribution+ text analytical applications: comparing texts according to their

thematic compactness+ no reference corpus required- set of TC words is invariant

- “interpretation without interpretor” – interpretation alwaysdepends on the point of view

Page 29: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

TC: discussion

Pros and cons of TC words

+ objective – based on frequency distribution+ text analytical applications: comparing texts according to their

thematic compactness+ no reference corpus required- set of TC words is invariant- “interpretation without interpretor” – interpretation always

depends on the point of view

Page 30: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keyword analysisIj

Page 31: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keywords and KWA

Keywords

▶ homonymous term1

▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test

A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)

1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.

Page 32: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keywords and KWA

Keywords

▶ homonymous term1

▶ words with higher relative frequency in a text

▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test

A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)

1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.

Page 33: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keywords and KWA

Keywords

▶ homonymous term1

▶ words with higher relative frequency in a text▶ based on comparison with reference corpus

▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test

A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)

1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.

Page 34: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keywords and KWA

Keywords

▶ homonymous term1

▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test

A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)

1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.

Page 35: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keywords and KWA

Keywords

▶ homonymous term1

▶ words with higher relative frequency in a text▶ based on comparison with reference corpus▶ significance testing: χ2 test, log-likelihood (G) test, Fisher test

A word-form which recurs within the text in question will be morelikely to be key in it. (Scott–Tribble 2006)

1For other meanings see e.g. Williams 1976 or Wierzbicka 1997.

Page 36: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Obtaining keywords – algorithm

Procedure

▶ count frequency of each word – most frequent words are the,of, was…

▶ compare it with a frequency of the same word in a corpus▶ use statistical tests: χ2, log-likelihood or Fisher to find out if

the difference is significant▶ interpret top X most prominent keywords

Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.

Page 37: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Obtaining keywords – algorithm

Procedure

▶ count frequency of each word – most frequent words are the,of, was…

▶ compare it with a frequency of the same word in a corpus

▶ use statistical tests: χ2, log-likelihood or Fisher to find out ifthe difference is significant

▶ interpret top X most prominent keywords

Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.

Page 38: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Obtaining keywords – algorithm

Procedure

▶ count frequency of each word – most frequent words are the,of, was…

▶ compare it with a frequency of the same word in a corpus▶ use statistical tests: χ2, log-likelihood or Fisher to find out if

the difference is significant

▶ interpret top X most prominent keywords

Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.

Page 39: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Obtaining keywords – algorithm

Procedure

▶ count frequency of each word – most frequent words are the,of, was…

▶ compare it with a frequency of the same word in a corpus▶ use statistical tests: χ2, log-likelihood or Fisher to find out if

the difference is significant▶ interpret top X most prominent keywords

Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.

Page 40: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Obtaining keywords – algorithm

Procedure

▶ count frequency of each word – most frequent words are the,of, was…

▶ compare it with a frequency of the same word in a corpus▶ use statistical tests: χ2, log-likelihood or Fisher to find out if

the difference is significant▶ interpret top X most prominent keywords

Keywords: Words which appear in a text or corpus that arestatistically significantly more frequent than would be expected bychance when compared to a corpus which is larger or of equal size.

Page 41: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Note on significance and effect size

Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)

Metrics used to calculate keyness

▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)

▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:

1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric

Page 42: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Note on significance and effect size

Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)

Metrics used to calculate keyness

▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)

▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:

1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric

Page 43: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Note on significance and effect size

Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)

Metrics used to calculate keyness

▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)

▶ relevance – importance of the difference (for interpretation)

▶ crucial for the top X approach:

1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric

Page 44: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Note on significance and effect size

Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)

Metrics used to calculate keyness

▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)

▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:

1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric

Page 45: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Note on significance and effect size

Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)

Metrics used to calculate keyness

▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)

▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:

1. identification of KWs – statistical tests

2. ranking of KWs – task for a different metric

Page 46: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Note on significance and effect size

Gabrielatos, C. & Marchi, A. (2012): there is a difference between(statistical) significance and (linguistic) relevance (effect size)

Metrics used to calculate keyness

▶ significance – level of certainty we have that the differenceexists (N.B. χ2 test is “asymptotically true”)

▶ relevance – importance of the difference (for interpretation)▶ crucial for the top X approach:

1. identification of KWs – statistical tests2. ranking of KWs – task for a different metric

Page 47: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

DIN coefficientVariation on the Sørensen–Dice’s coefficient2:

DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)

▶ values of DIN

▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)

▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)

▶ identical value of DIN for words appearing in a target textonly (!)

▶ useful for ranking of KWs (not for their identification!)

2cf. Hofland–Johansson (1982).

Page 48: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

DIN coefficientVariation on the Sørensen–Dice’s coefficient2:

DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)

▶ values of DIN▶ -100 (= when a word is present only in the RefC)

▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)

▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)

▶ identical value of DIN for words appearing in a target textonly (!)

▶ useful for ranking of KWs (not for their identification!)

2cf. Hofland–Johansson (1982).

Page 49: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

DIN coefficientVariation on the Sørensen–Dice’s coefficient2:

DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)

▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)

▶ 100 (= when a word is present only in the target corpus)▶ represents the proportion of the difference of relative

frequencies to their mean (× 50)▶ identical value of DIN for words appearing in a target text

only (!)▶ useful for ranking of KWs (not for their identification!)

2cf. Hofland–Johansson (1982).

Page 50: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

DIN coefficientVariation on the Sørensen–Dice’s coefficient2:

DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)

▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)

▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)

▶ identical value of DIN for words appearing in a target textonly (!)

▶ useful for ranking of KWs (not for their identification!)

2cf. Hofland–Johansson (1982).

Page 51: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

DIN coefficientVariation on the Sørensen–Dice’s coefficient2:

DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)

▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)

▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)

▶ identical value of DIN for words appearing in a target textonly (!)

▶ useful for ranking of KWs (not for their identification!)

2cf. Hofland–Johansson (1982).

Page 52: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

DIN coefficientVariation on the Sørensen–Dice’s coefficient2:

DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)

▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)

▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)

▶ identical value of DIN for words appearing in a target textonly (!)

▶ useful for ranking of KWs (not for their identification!)

2cf. Hofland–Johansson (1982).

Page 53: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

DIN coefficientVariation on the Sørensen–Dice’s coefficient2:

DIN = 100× RelFq(Target)− RelFq(Reference)RelFq(Target) + RelFq(Reference)

▶ values of DIN▶ -100 (= when a word is present only in the RefC)▶ 0 (= when a word occurs equally in target and RefC)▶ 100 (= when a word is present only in the target corpus)

▶ represents the proportion of the difference of relativefrequencies to their mean (× 50)

▶ identical value of DIN for words appearing in a target textonly (!)

▶ useful for ranking of KWs (not for their identification!)2cf. Hofland–Johansson (1982).

Page 54: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Example 1: Grammatical wordsKeywords from all Husák’s New Year’s Addresses

0 100 200 300 400 500 600

010

020

030

040

050

060

0All NYA (gram. words highlighted)

Dice (rank)

Log−

likel

ihoo

d (r

ank)

1987

spoluobčané

1986

rozkvétala

přičiňmestřízlivým

domovům

vzkvétalapozdravuji

vstupujeme

1983

drazí

1982

dařila

udělejme

přeji

připomeneme

dopadyuplynulým

xvii

pokročili

novoroční

zamýšlíme

zdravím

pohodupohodě

svědomitou

posílámeoptimismempoděkovatxvi

vzestupný

opíráme

generacím

přikládámekvalitněji

spokojený

vážení

náročněvašim

zdravímeuplynulý

přátelé

důvěrou

opravňujínejspolehlivější

rozvíjelo

prožili

tvořivá

1981

hodnotíme

vzpomínat

plodem

nestraníkůvýhodnounastávajícím

rodinném

vážené

upevnili

podporujeme

přátele

soudružky

poctivou

opravňuje

spokojenost

realisticky

srdečně

energičtěji

přispěli

nadcházejícím

odvrácení

díváme

spokojenosti

bratrskému

přátelům

odhodláním

klademe

efektivněji

dialogu

upřímně

osvobozováníudrželi

uplynulého

oceňujeme

rozkvět

prošli

činorodéobětavávykonanouzačneme

všestranná

vážíme

obětavé

hrozby

tvořivou

připomněli

vyvrcholeníÚspěšně

přejemejdeme

obav

věříme

spolehlivou

mírový

příslušníkům

vstupu

uvědomujeme

hrdíškolskýchdynamiku

jistot

osmé

prahu

usilujeme

prostá

chciuplynulém

připomínat

přáteli

ústavech

historickýmispokojeněvykonaliodhodláni

konstruktivní

občanům

tužby

vyspělou

pramení

jménem

přáním

osvobozenecký

pozdravy

hranicemi

vůlí

inteligenci

pozitivníchodpovědně

přestavběsložitá

rozloučili

dobrým

horečného

měnydůstojněrovnováhu

jistoty

slabákontinentěbratrský

rokem

považujeme

podílelišťastného

dějinnéhladiny

poctivédovolte

děkuji

štěstí

složitou

Čechů

vyspělostnadcházející

náročné

pevným

překonávat

šťastný

zastupitelských

rozdílným

slováků

národností

vítěznou

pozdravil

spojenectví

desetiletí

katastrofy

upřímné

dnešním

angažovanost

samozřejmé

kriticky

zlepšovat

úspěšný

zdraví

pozvednout

složitost

přesvědčeni

spravedlnost

rozkvětu

budeme

Československá

vyžadovat

naléhavé

ženámtvořivé

jaderné

hrdostikonfrontacezasedáních

osobním

naléhavěhovořímedařilo

hrdostí

progresivních

důvěry

našim

příznivý

životě

rolníkům

společenství

minulého

sborů

rozvíjela

ústředního

překážek

vám

stručněmládeži

mírového

zhodnotil

společným

vlast

přesvědčen

soudružské

bratrskými

osvobozeneckého

otevřela

mírových

důvěra

dělníkům

pevné

podnětem

abyste

metra

obětavou

celkověpřekonáváníspojenci

urychlení

drahé

potvrdily

náročnýzmařit

můžemevlasti

přestavby

odzbrojení

uplynulých

úsecích

fronty

uskutečňovat

mírovému

varšavské

zajistili

výsledkům

dosažené

zřízením

Československo

reálné

spojeno

Československa

dopravě

úsekůpodílí

nového

obětavě

užívání

službách

uvolňování

prohlubovat

udržet

aktivně

překonání

prohlubování

zápasu

všestranný

krizových

nejvyšších

vzájemně

ekonomikunemálo

světě

pokrokovýmnezbytnástarat

úspěšného

zničení

zápas

žít

usilovat

uspokojením

čelit

továrnách

rok

občany

vrstev

zdravotnických

důraz

částech

kultuře

náročných

vyžadují

dobré

1978

pokračoval

významných

blahovšestrannéhoodkazu

svazem

příznivé

našeho

podporuje

bratrské

energetickéhospodařitvstříc

smyslem

museli

přispívat

široká

jaderných

životních

solidaritutěžkosti

státy

úspěchů

srdce

pokroku

radost

správě

příštích

dalšími

říci

důkazzdůraznit

soudruzi

odkaz

socialistického

složitých

napětí

bratrských

události

rovnosti

právem

prospěchu

zlepšování

vědomím

minulém

občanů

vnější

zdrojem

dosáhli

víme

zdravotnictví

život

zabezpečit

mírovou

stupních

dobrou

všem

stupňů

plnou

loňském

uskutečňování

láskou

hodně

zápase

soužití

vás

lidstvo

lépe

socialistickými

zřízení

nimiž

národně

naší

správnou

armádou

duchovní

dobrýchnovém

pevnou

letošním

kupředu

prosinci

sovětským

povinnosti

lidu

záměry

chceme

upevňování

současných

složek

sociálních

výsledky

ovzduší

náš

vývojem

našich

abychom

potřebám

přínospotvrzují

perspektivy

odhodlání

československého

všestranné

všude

roku

krok

vůle

pracovišti

hmotné

pětiletky

generace

zásluhou

významné

republika

nezávislosti

aktivita

československé

stavbách

společenskýotevřeně

pokračovat

rychleji

životní

nadále

budoucnosti

řešit

dobrá

xiv

mezinárodních

pracovat

dobrýbezpečnost

společný

podporu

světem

pracovištích

vlastenectví

abych

cestou

věnovat

jsme

cesta

mírové

přání

národní

lepší

úspěchy

rostoucí

rozvoji

dalšího

pozitivní

našimi

máme

životanaše

milióny

rozvíjet

výsledků

zbrojení

pokrok

státu

našemu

důsledně

návrhy

podmínkách

volby

vztazích

dalším

šesté

komunistické

oblastech

školstvíveškeré

zeměmi

občané

roce

ozbrojených

zajištění

sovětskou

zlepšeníjednoty

zachování

úsilí

zájmům

úroveň

dalších

inteligencezávěrykaždém

socialistických

vysoce

mezinárodním

jednou

postup

díky

vysokounedostatkůmateriální

úspěšně

vývoj

evropě

příští

kterém

úkoly

vše

socialistická

plní

pracujícího

potřebné

krize

upevnění

rozvoj

cestu

odpovídápostupujakýdalšímu

svým

společné

mezinárodní

program

naši

musíme

výročí

národů

spoluprácemnoho

ekonomického

cíle

hospodářství

možností

úrovně

politika

států

bezpečnosti

plně

velkou

kterýmlidstva

nás

společnosti

vědeckých

splnění

celémzájmu

nedostatky

rozvoje

prací

přátelství

práci

znovu

našem

míru

svobody

cen

socialistické

spolupráci

politice

tvůrčí

problémů

postavení

dále

sjezd

výboru

spolu

všech

problémy

úkolů

národy

sociální

hospodářského

organizací

potřeb

politiku

společenské

země

nových

orgánůvelký

lid

celého

růstuvztahy

celé

další

i

plněnísíly

lidí

nám

zasedání

sovětskéhopracujících

řešení

sil

politiky

pracovní

svousjezdu

proto

aby

let

svazu

práce

všechny

které

pro

si

a

to

strany

ve

s

19871986

19831982

1981

našim

vám

abyste

1978

našeho

vstříc

všem

vás

nimiž

naší

náš

našich

abychom

abych

našimi

naše

našemu

veškeré

každém

díkykterém

vše

jaký

svým

naši

kterým

nás

našem

všech

i

nám

svou

proto

aby

všechny

které

pro

si

a

tove

s

Page 55: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Tools for KWAIj

Page 56: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

KWordshttp://kwords.korpus.cz

Page 57: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

KWords: analysed text

Page 58: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

KWords: list of keywordsTwo types of prominent units: keywords and thematic concentration

Page 59: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Dispersion plot

SOTU 2016: terrorism × economy

Page 60: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keyword links

Keyword links according to the size of the window

Distant KW Links = KWs appearing in distant context (-15;-5)and (5;15); these KW links indicate that the themesrepresented by these KWs may form adiscourse-semantic network

Immediate KW Links (multi-word KWs) = co-occurrence of twoor more KWs within immediate or near context(-2;2); these adjacent KW may signal multi-word KWunit (e.g. American people, better politics)

custom = arbitrary size of the span

Page 61: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keyword links

Keyword links according to the size of the window

Distant KW Links = KWs appearing in distant context (-15;-5)and (5;15); these KW links indicate that the themesrepresented by these KWs may form adiscourse-semantic network

Immediate KW Links (multi-word KWs) = co-occurrence of twoor more KWs within immediate or near context(-2;2); these adjacent KW may signal multi-word KWunit (e.g. American people, better politics)

custom = arbitrary size of the span

Page 62: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Keyword links

Keyword links according to the size of the window

Distant KW Links = KWs appearing in distant context (-15;-5)and (5;15); these KW links indicate that the themesrepresented by these KWs may form adiscourse-semantic network

Immediate KW Links (multi-word KWs) = co-occurrence of twoor more KWs within immediate or near context(-2;2); these adjacent KW may signal multi-word KWunit (e.g. American people, better politics)

custom = arbitrary size of the span

Page 63: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

isil

qaed

a

bipa

rtisa

nha

rdwor

king

wea

ken

folk

sterro

rists

american

sterro

rist

voices

stron

gest

america

plane

tsta

mpallies

leadersh

ip

democr

acy

climate

trends

econom

y

american

harder

retirement

fellow

businesses

politics

our

electedal

citizensnationjobstonightcongress

voteeverybody

spiritagreeworkers

families

opportunity

happen

energy

change

security

basic

jobnearly

mili

tary

belie

ve

we

better

us

every

futureleteveryth

ingstatescarelot

countryworld

want

keep

need

work

make

who

justyear

years

people

new

Page 64: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Comparison

For comparing texts – time series (SOTU)

Page 65: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

State of the UnionIj

Page 66: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Obama’s State of the Union Address

Eight addresses (2009–2016)

2009 2010 2011 2012 2013 2014 2015 2016Tokens 6346 8024 7611 7743 7427 7506 7479 6671Types 1347 1531 1505 1555 1561 1598 1521 1422

source: http://www.whitehouse.gov

Average length N = 7351Average vocabulary V(N) = 1505

Reference corpus: British National Corpus (BNC)

Page 67: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Permanent KWs (Key KWs)

KWs appearing in all eight addresses:america, american, americans, businesses, congress, country,economy, jobs, nation, tonight

KWs appearing in six or seven addresses:energy, every, let, more, new, people, democrats, families, make,millions, republicans, tax, why

Page 68: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

SOTU: Topics – economy/politics

Page 69: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Reference corpus in KWAIj

Page 70: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Reference corpus in KWA

What does reference corpus affect?

size: bigger reference corpus ⇒ more KWs

composition: different reference corpora represent different readers(conceptualized reader)

▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from

the past, with specific background…)

Page 71: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Reference corpus in KWA

What does reference corpus affect?

size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers

(conceptualized reader)

▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from

the past, with specific background…)

Page 72: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Reference corpus in KWA

What does reference corpus affect?

size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers

(conceptualized reader)▶ balanced corpus ∼ general reader

▶ specialized corpus ∼ specific reader (e.g. fromthe past, with specific background…)

Page 73: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Reference corpus in KWA

What does reference corpus affect?

size: bigger reference corpus ⇒ more KWscomposition: different reference corpora represent different readers

(conceptualized reader)▶ balanced corpus ∼ general reader▶ specialized corpus ∼ specific reader (e.g. from

the past, with specific background…)

Page 74: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Different readers = different interpretations

Contrastive KWA analysis

▶ different RefCs: interferences – time, style, topic differences

▶ New Year’s Addresses of the last communist president of theCzechoslovakia Gustáv Husák (1975–1989)

▶ contemporary reader (SYN2010) × reader from the past(Totalita)

▶ State of the Union addresses of Barack Obama (2009–2016)

▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)

Page 75: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Different readers = different interpretations

Contrastive KWA analysis

▶ different RefCs: interferences – time, style, topic differences▶ New Year’s Addresses of the last communist president of the

Czechoslovakia Gustáv Husák (1975–1989)

▶ contemporary reader (SYN2010) × reader from the past(Totalita)

▶ State of the Union addresses of Barack Obama (2009–2016)

▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)

Page 76: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Different readers = different interpretations

Contrastive KWA analysis

▶ different RefCs: interferences – time, style, topic differences▶ New Year’s Addresses of the last communist president of the

Czechoslovakia Gustáv Husák (1975–1989)▶ contemporary reader (SYN2010) × reader from the past

(Totalita)

▶ State of the Union addresses of Barack Obama (2009–2016)

▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)

Page 77: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Different readers = different interpretations

Contrastive KWA analysis

▶ different RefCs: interferences – time, style, topic differences▶ New Year’s Addresses of the last communist president of the

Czechoslovakia Gustáv Husák (1975–1989)▶ contemporary reader (SYN2010) × reader from the past

(Totalita)▶ State of the Union addresses of Barack Obama (2009–2016)

▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)

Page 78: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Different readers = different interpretations

Contrastive KWA analysis

▶ different RefCs: interferences – time, style, topic differences▶ New Year’s Addresses of the last communist president of the

Czechoslovakia Gustáv Husák (1975–1989)▶ contemporary reader (SYN2010) × reader from the past

(Totalita)▶ State of the Union addresses of Barack Obama (2009–2016)

▶ general reader (BNC) × politician/expert (rest of Obama’sspeeches)

Page 79: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Husák: Influence of the reference corpora

What happens if we compare texts to different RefCs?

▶ the inventory of KWs does not differ substantially

▶ the difference is in ranking (prominence of KWs – DIN)

Historical reader (Totalita)

→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.

Contemporary reader (SYN2010)

→ connected with historical events▶ ideology▶ archaisms, historism

Page 80: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Husák: Influence of the reference corpora

What happens if we compare texts to different RefCs?

▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)

Historical reader (Totalita)

→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.

Contemporary reader (SYN2010)

→ connected with historical events▶ ideology▶ archaisms, historism

Page 81: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Husák: Influence of the reference corpora

What happens if we compare texts to different RefCs?

▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)

Historical reader (Totalita)

→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.

Contemporary reader (SYN2010)

→ connected with historical events▶ ideology▶ archaisms, historism

Page 82: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Husák: Influence of the reference corpora

What happens if we compare texts to different RefCs?

▶ the inventory of KWs does not differ substantially▶ the difference is in ranking (prominence of KWs – DIN)

Historical reader (Totalita)

→ genre differences▶ Modal verbs: want, can▶ Verbs: 1. sg./pl.

Contemporary reader (SYN2010)

→ connected with historical events▶ ideology▶ archaisms, historism

Page 83: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Detailed comparison – 3 thematic groups

Cold war: mír, míru, mírová, mírové, mírového, mírovému,mírovou, mírový, mírových, mírovými, mírumilovné,mírumilovných, mírumilovným; napětí; odzbrojení,výzbroje, zbrojení, zbrojením, ozbrojených

Collective possession: náš, naše, našeho, našem, našemu, naši,naší, našich, našim, naším, našimi

Ideo markers: socialismu, socialismus, socialistická, socialistické,socialistického, socialistickém, socialistickému,socialistickou, socialistický, socialistických,socialistickým, socialistickými; komunismu,komunisté, komunistů, ksč, komunistům, komunistykomunistická, komunistické, komunistickým

Page 84: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Cold war40

5060

7080

9010

0

Cold War KWs in SYN−KWA and TOT−KWA

Year

DIN

SYN−KWATOT−KWA

1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989

Fidler–Cvrček (2015)

Page 85: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Collective possession65

7075

8085

9095

KWs "our" in SYN−KWA and TOT−KWA

Year

DIN SYN−KWA

TOT−KWA

1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989

Fidler–Cvrček (2015)

Page 86: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Ideological markers30

4050

6070

8090

100

Ideological markers KWs in SYN−KWA and TOT−KWA

Year

DIN SYN−KWA

TOT−KWA

1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989

Fidler–Cvrček (2015)

Page 87: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Reader from the past × contemporary reader

Totalita

1. style & genre: fellowcitizens, friends

2. propaganda: blossom,succeed

3. 1st pers. sg. (greet, wish)

SYN2010

1. ideology-related:comrade(s), socialist,five-year plan

2. period-specific: brotherly,liberating, feverish, dutiful,imperialistic

Page 88: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Difference in sensitivity

▶ lines for both readers have similar tendencies

▶ contemporary reader has higher overall level of prominence(DIN)

▶ tendencies are more visible for reader from the past▶ contemporary reader cannot distinguish subtle changes

(overwhelmed by unusual lexicon)▶ astute reader from the past might notice slight shifts in the

discourse

Page 89: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Difference in sensitivity

▶ lines for both readers have similar tendencies▶ contemporary reader has higher overall level of prominence

(DIN)

▶ tendencies are more visible for reader from the past▶ contemporary reader cannot distinguish subtle changes

(overwhelmed by unusual lexicon)▶ astute reader from the past might notice slight shifts in the

discourse

Page 90: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Difference in sensitivity

▶ lines for both readers have similar tendencies▶ contemporary reader has higher overall level of prominence

(DIN)▶ tendencies are more visible for reader from the past

▶ contemporary reader cannot distinguish subtle changes(overwhelmed by unusual lexicon)

▶ astute reader from the past might notice slight shifts in thediscourse

Page 91: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Difference in sensitivity

▶ lines for both readers have similar tendencies▶ contemporary reader has higher overall level of prominence

(DIN)▶ tendencies are more visible for reader from the past▶ contemporary reader cannot distinguish subtle changes

(overwhelmed by unusual lexicon)

▶ astute reader from the past might notice slight shifts in thediscourse

Page 92: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Difference in sensitivity

▶ lines for both readers have similar tendencies▶ contemporary reader has higher overall level of prominence

(DIN)▶ tendencies are more visible for reader from the past▶ contemporary reader cannot distinguish subtle changes

(overwhelmed by unusual lexicon)▶ astute reader from the past might notice slight shifts in the

discourse

Page 93: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

2015 SOTU: General vs. specialist’s viewRank Keyword

1 rebekah2 bipartisan3 hardworking4 loopholes5 childcare6 republicans7 folks8 veterans9 diplomacy

10 striving11 terrorists12 americans13 infrastructure14 fastest15 democrats

Rank Keyword1 ben2 economics3 childcare4 rebekah5 west6 believed7 striving8 tools9 sick

10 spread11 write12 paid13 surely14 issues15 scientists

Page 94: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

2015 SOTU: Differences

Specialist’s view:Rhetorics/style: believed, write, thing, toolsImportant topics: economics, leave, paid, childcare

General view:Political speeches: bipartisan, republicans, folks, veterans,diplomacy, terrorists, americans, infrastructure, democrats,combat, sanctionsTopical words: hardworking, loopholes, childcare, striving, fastest

Page 95: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

2015 SOTU: Differences

Specialist’s view:Rhetorics/style: believed, write, thing, toolsImportant topics: economics, leave, paid, childcare

General view:Political speeches: bipartisan, republicans, folks, veterans,diplomacy, terrorists, americans, infrastructure, democrats,combat, sanctionsTopical words: hardworking, loopholes, childcare, striving, fastest

Page 96: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

References▶ Cvrček, V. – Vondřička, P. (2013): KWords. FF UK. Praha. Available at:

<http://kwords.korpus.cz>.▶ Fidler, M. - Cvrček, V. (2015): A Data-Driven Analysis of Reader Viewpoints:

Reconstructing the Historical Reader Using Keyword Analysis. Journal of SlavicLinguistics 23(2), (p. 197–239).

▶ Gabrielatos, C. – Marchi, A. (2012): Keyness: appropriate metrics and practicalissues. Paper presented at the CADS International Conference, Bologna, Italy,September 2012 (www.gabrielatos.com/Presentations.htm).

▶ Hofland, K. – Johansson, S. (1982): Word frequencies in British and AmericanEnglish. Bergen: The Norwegian computing centre for the Humanities.

▶ Popescu, I. – Altmann, G. (2006): Some Aspects of Word Frequencies.Glottometrics 13, (p. 23–46).

▶ Scott, M. – Tribble, C. (2006): Textual patterns: Keyword and corpus analysis inlanguage education. Amsterdam: Benjamins.

▶ Wierzbicka, A. (1997): Understanding cultures through their keywords. English,Russian, Polish, German, and Japanese. Oxford, New York: Oxford UP.

▶ Williams, R. (1976/85): Keywords: A vocabulary of culture and society. NewYork: Oxford UP.

Page 97: The Theory behind Keyword Analysis...Obtaining keywords – algorithm Procedure count frequency of each word – most frequent words are the, of, was… compare it with a frequency

Thank you for your attention!


Recommended