Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Corpus Linguistics(L415/L615)
Collocations
Markus Dickinson
Department of Linguistics, Indiana UniversityFall 2015
1 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Collocations
Collocations are characteristic co-occurrence patterns oftwo (or more) lexical items
1. Firthian definition: combinations of words that co-occurmore frequently than by chance
I “You shall know a word by the company it keeps” (Firth1957)
2. Phraseological definition: The meaning tends to bemore than the sum of its parts
I “a sequence of two or more consecutive words,. . . whose exact and unambiguous meaning cannot bederived directly from the meaning or connotation of itscomponents” (Choueka 1988)
2 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Collocations
Some examples by different definitions:I Firth + phraseology: couch potatoI Firth only: potato peelerI Phraseology only: broken record
3 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Collocations
Collocations are hard to define by intuition:I Corpora have been able to reveal connections
previously unseenI Though, it may not be clear what the theoretical basis of
collocations areI Q: how (where) do they fit into grammar?
Firthian definition is empirical⇒ need test for “co-occurmore frequently than by chance”I Significance test / information theoretic measures
4 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Related conceptsColligations
A colligation is a slightly different concept:I Collocation of a node word with a particular class of
words (e.g., determiners)
Colligations often create noise in a list of collocationsI e.g., this house because this is so common on its own,
and determiners appear before nounsI Thus, people sometimes use stop words to filter out
non-collocations
5 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Related conceptsSemantic prosody & preference
Semantic prosody = “a form of meaning which isestablished through the proximity of a consistent series ofcollocates” (Louw 2000)I Idea: you can tell the semantic prosody of a word by the
types of words it frequently co-occurs withI These are typically negative: e.g., peddle, ripe for, getoneself verbed
I This type of co-occurrence often leads to generalsemantic preferences
I e.g., utterly, totally, etc. typically have a feature of‘absence or change of state’
6 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Defining a collocationTowards corpus-based metrics
Collocations are expressions of two or more words that arein some sense conventionalized as a group
I strong tea (cf. ??powerful tea)I international best practiceI kick the bucket
Importance of the context: “You shall know a word by acompany it keeps” (Firth 1957)I There are lexical properties that more general syntactic
properties do not capture
This slide and the next 3 adapted from Manning and Schutze (1999),
Foundations of Statistical Natural Language Processing
7 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Prototypical collocations
Prototypically, collocations meet the following criteria:
I Non-compositional: meaning of kick the bucket notcomposed of meaning of parts
I Non-substitutable: orange hair just as accurate as redhair, but some don’t say it
I Non-modifiable: often we cannot modify a collocation,even though we normally could modify one of thosewords: ??kick the red bucket
8 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Compositionality tests
Previous properties may be hard to verify with corpus data
(At least) two tests we can use with corpora:I Is the collocation translated word-by-word into another
language?I e.g., Collocation make a decision is not translated
literally into FrenchI Do the two words co-occur more frequently together
than we would otherwise expect?I e.g., of the is frequent, but both words are frequent, so
we might expect this
9 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Kinds of collocations
Calculations ideally take into account variability:
I Light verbs: verbs convey very little meaning but mustbe the right one:
I make|*take a decision, take|*make a walkI Phrasal verbs: main verb and particle combination,
often translated as a single word:I to tell off, to call up
I Proper nouns: slightly different than others, but eachrefers to a single idea (e.g., Brooks Brothers)
I Terminological expressions: technical terms that form aunit (e.g., hydraulic oil filter)
I Syntactically adaptable expressions: bite|biting|bit thedust, take leave of his|her|your senses
I Non-adjacent collocations: faint (stale|apricot) smell
10 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Ideas for calculating collocations
We want to tell if two words occur together more than bychance, meaning we should examine:I Observed frequency of the two words togetherI Expected frequency of the two words together
I This if often derived from observed frequencies of theindividual words
I Metrics for combining observed & expected frequenciesI e.g., t = observed−expected
√observed
(from Gries 2009)
11 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Calculating collocations
Simplest approach: use frequency countsI Two words appearing together a lot are a collocation
The problem is that we get lots of uninteresting pairs offunction words (M&S 1999, table 5.1)
C(w1, w2) w1 w2
80871 of the58841 in the26430 to the21842 on the
(Slides 12–24 are based on Manning & Schutze (M&S) 1999)
12 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
POS filtering
To remove frequent pairings which are uninteresting, we canuse a POS filter (Justeson and Katz 1995)
I Only examine word sequences which fit a particularpart-of-speech pattern:
A N, N N, A A N, A N N, N A N, N N N, N P N
A N linear functionN A N mean squared errorN P N degrees of freedom
I Crucially, all other sequences are removedP D of theMV V has been
13 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
POS filtering (2)
Some results after tag filtering (M&S 1999, table 5.3)
C(w1, w2) w1 w2 Tag Pattern11487 New York A N
7261 United States A N5412 Los Angeles N N3301 last year A N
⇒ Fairly simple, but surprisingly effective
I Needs to be refined to handle verb-particle collocationsI Kind of inconvenient to write out patterns you want
14 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
(Pointwise) Mutual Information
Pointwise mutual information (PMI) compares:
I Observed: the actual probability of the two wordsappearing together (p(w1w2))
I Expected: the probability of the two words appearingtogether if they are independent (p(w1)p(w2))
The pointwise mutual information is a measure to do this:
(1) I(w1,w2) = log p(w1w2)p(w1)p(w2)
I The higher the value, the more surprising it is
15 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Pointwise Mutual Information Equation
Probabilities (p(w1w2), p(w1), p(w2)) calculated as:
(2) p(x) = C(x)N
I N is the number of words in the corpusI The number of bigrams ≈ the number of unigrams
(3) I(w1,w2) = log p(w1w2)p(w1)p(w2)
= logC(w1w2)
NC(w1)
NC(w2)
N
= log[N C(w1w2)C(w1)C(w2)
]
16 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Mutual Information example
We want to know if Ayatollah Ruhollah is a collocation in adata set we have:I C(Ayatollah) = 42I C(Ruhollah) = 20I C(Ayatollah Ruhollah) = 20I N = 14, 307, 668
(4) I(Ayatollah,Ruhollah) = log2
20N
42N ×
20N= log2 N 20
42×20 ≈
18.38
To see how good a collocation this is, we need to compare itto others
17 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Problems for Mutual Information
A few problems:I Sparse data: infrequent bigrams for infrequent words
get high scoresI Tends to measure independence (value of 0) better
than dependenceI Doesn’t account for how often the words do not appear
together (M&S 1999, table 5.15)
18 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Motivating Contingency Tables
What we can instead get at is: which bigrams are likely, outof a range of possibilities?
Looking at the Arthur Conan Doyle story A Case of Identity,we find the following possibilities for one particular bigram:I sherlock followed by holmesI sherlock followed by some word other than holmesI some word other than sherlock preceding holmesI two words: the first not being sherlock, the second not
being holmes
19 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Contingency Tables
We can count up these different possibilities and put theminto a contingency table (or 2x2 table)
B = holmes B , holmes TotalA = sherlock 7 0 7A , sherlock 39 7059 7098Total 46 7059 7105
The Total row and Total column are the marginals
I Values in this chart are the observed frequencies (fo)
20 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Observed bigram probabilities
Each cell indicates a bigram: divide each cell by totalnumber of bigrams (7105) to get probabilities:
holmes ¬ holmes Totalsherlock 0.00099 0.0 0.00099¬ sherlock 0.00549 0.99353 0.99901Total 0.00647 0.99353 1.0
Marginal probabilities indicate probabilities for a given wordI e.g., p(sherlock) = 0.00099 and p(holmes) = 0.00647
21 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Expected bigram probabilities
Assuming sherlock & holmes are independent results in:
holmes ¬ holmes Totalsherlock 0.00647 x 0.00099 0.99353 x 0.00099 0.00099¬ sherlock 0.00647 x 0.99901 0.99353 x 0.99901 0.99901Total 0.00647 0.99353 1.0
I This is simply pe(w1,w2) = p(w1)p(w2)
22 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Expected bigram frequencies
Multiplying by 7105 (the total number of bigrams) gives usthe expected number of times we should see each bigram:
holmes ¬ holmes Totalsherlock 0.05 6.95 7¬ sherlock 45.5 7052.05 7098Total 46 7059 7105
I Values in this chart are expected frequencies (fe)
23 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Pearson’s chi-square test
The chi-square (χ2) test measures how far the observedvalues are from the expected values:
(5) χ2 =∑ (fo−fe)2
fe
(6)χ2 =
(7−0.05)2
0.05 +(0−6.95)2
6.95 +(39−45.5)2
45.5 +(7059−7052.05)2
7052.05
= 966.05 + 6.95 + 1.048 + 0.006
= 974.05
Looking this up in a table shows it’s unlikely to be chanceI χ2 test does not work well for rare events, i.e., fe ≤ 5I Other tests can be employed using the same tables
24 / 25
Corpus Linguistics
Collocations
CollocationsRelated concepts
Defining a collocation
CalculatingcollocationsFiltering
PMI
chi-square
Other issues
Other issues
Gries (2009) lists some other points to consider:I Fertility: # of unique types associate with a wordI Lexical gravity: window-based approaches that find the
most informative contextual slotsI Multi-word collocations: breaking down the string into
most informative units for expected frequenciesI Variable n: bottom-up approaches to defining the size
of n for n-gram collocatesI Discontinuous n-grams
25 / 25