Corpus Linguistics (L415/L615) -...

Corpus Linguistics

Collocations

CollocationsRelated concepts

Defining a collocation

CalculatingcollocationsFiltering

PMI

chi-square

Other issues

Corpus Linguistics(L415/L615)

Collocations

Markus Dickinson

Department of Linguistics, Indiana UniversityFall 2015

1 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Collocations

Collocations are characteristic co-occurrence patterns oftwo (or more) lexical items

1. Firthian definition: combinations of words that co-occurmore frequently than by chance

I “You shall know a word by the company it keeps” (Firth1957)

2. Phraseological definition: The meaning tends to bemore than the sum of its parts

I “a sequence of two or more consecutive words,. . . whose exact and unambiguous meaning cannot bederived directly from the meaning or connotation of itscomponents” (Choueka 1988)

2 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Collocations

Some examples by different definitions:I Firth + phraseology: couch potatoI Firth only: potato peelerI Phraseology only: broken record

3 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Collocations

Collocations are hard to define by intuition:I Corpora have been able to reveal connections

previously unseenI Though, it may not be clear what the theoretical basis of

collocations areI Q: how (where) do they fit into grammar?

Firthian definition is empirical⇒ need test for “co-occurmore frequently than by chance”I Significance test / information theoretic measures

4 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Related conceptsColligations

A colligation is a slightly different concept:I Collocation of a node word with a particular class of

words (e.g., determiners)

Colligations often create noise in a list of collocationsI e.g., this house because this is so common on its own,

and determiners appear before nounsI Thus, people sometimes use stop words to filter out

non-collocations

5 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Related conceptsSemantic prosody & preference

Semantic prosody = “a form of meaning which isestablished through the proximity of a consistent series ofcollocates” (Louw 2000)I Idea: you can tell the semantic prosody of a word by the

types of words it frequently co-occurs withI These are typically negative: e.g., peddle, ripe for, getoneself verbed

I This type of co-occurrence often leads to generalsemantic preferences

I e.g., utterly, totally, etc. typically have a feature of‘absence or change of state’

6 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Defining a collocationTowards corpus-based metrics

Collocations are expressions of two or more words that arein some sense conventionalized as a group

I strong tea (cf. ??powerful tea)I international best practiceI kick the bucket

Importance of the context: “You shall know a word by acompany it keeps” (Firth 1957)I There are lexical properties that more general syntactic

properties do not capture

This slide and the next 3 adapted from Manning and Schutze (1999),

Foundations of Statistical Natural Language Processing

7 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Prototypical collocations

Prototypically, collocations meet the following criteria:

I Non-compositional: meaning of kick the bucket notcomposed of meaning of parts

I Non-substitutable: orange hair just as accurate as redhair, but some don’t say it

I Non-modifiable: often we cannot modify a collocation,even though we normally could modify one of thosewords: ??kick the red bucket

8 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Compositionality tests

Previous properties may be hard to verify with corpus data

(At least) two tests we can use with corpora:I Is the collocation translated word-by-word into another

language?I e.g., Collocation make a decision is not translated

literally into FrenchI Do the two words co-occur more frequently together

than we would otherwise expect?I e.g., of the is frequent, but both words are frequent, so

we might expect this

9 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Kinds of collocations

Calculations ideally take into account variability:

I Light verbs: verbs convey very little meaning but mustbe the right one:

I make|*take a decision, take|*make a walkI Phrasal verbs: main verb and particle combination,

often translated as a single word:I to tell off, to call up

I Proper nouns: slightly different than others, but eachrefers to a single idea (e.g., Brooks Brothers)

I Terminological expressions: technical terms that form aunit (e.g., hydraulic oil filter)

I Syntactically adaptable expressions: bite|biting|bit thedust, take leave of his|her|your senses

I Non-adjacent collocations: faint (stale|apricot) smell

10 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Ideas for calculating collocations

We want to tell if two words occur together more than bychance, meaning we should examine:I Observed frequency of the two words togetherI Expected frequency of the two words together

I This if often derived from observed frequencies of theindividual words

I Metrics for combining observed & expected frequenciesI e.g., t = observed−expected

√observed

(from Gries 2009)

11 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Calculating collocations

Simplest approach: use frequency countsI Two words appearing together a lot are a collocation

The problem is that we get lots of uninteresting pairs offunction words (M&S 1999, table 5.1)

C(w1, w2) w1 w2

80871 of the58841 in the26430 to the21842 on the

(Slides 12–24 are based on Manning & Schutze (M&S) 1999)

12 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

POS filtering

To remove frequent pairings which are uninteresting, we canuse a POS filter (Justeson and Katz 1995)

I Only examine word sequences which fit a particularpart-of-speech pattern:

A N, N N, A A N, A N N, N A N, N N N, N P N

A N linear functionN A N mean squared errorN P N degrees of freedom

I Crucially, all other sequences are removedP D of theMV V has been

13 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

POS filtering (2)

Some results after tag filtering (M&S 1999, table 5.3)

C(w1, w2) w1 w2 Tag Pattern11487 New York A N

7261 United States A N5412 Los Angeles N N3301 last year A N

⇒ Fairly simple, but surprisingly effective

I Needs to be refined to handle verb-particle collocationsI Kind of inconvenient to write out patterns you want

14 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

(Pointwise) Mutual Information

Pointwise mutual information (PMI) compares:

I Observed: the actual probability of the two wordsappearing together (p(w1w2))

I Expected: the probability of the two words appearingtogether if they are independent (p(w1)p(w2))

The pointwise mutual information is a measure to do this:

(1) I(w1,w2) = log p(w1w2)p(w1)p(w2)

I The higher the value, the more surprising it is

15 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Pointwise Mutual Information Equation

Probabilities (p(w1w2), p(w1), p(w2)) calculated as:

(2) p(x) = C(x)N

I N is the number of words in the corpusI The number of bigrams ≈ the number of unigrams

(3) I(w1,w2) = log p(w1w2)p(w1)p(w2)

= logC(w1w2)

NC(w1)

NC(w2)

N

= log[N C(w1w2)C(w1)C(w2)

]

16 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Mutual Information example

We want to know if Ayatollah Ruhollah is a collocation in adata set we have:I C(Ayatollah) = 42I C(Ruhollah) = 20I C(Ayatollah Ruhollah) = 20I N = 14, 307, 668

(4) I(Ayatollah,Ruhollah) = log2

20N

42N ×

20N= log2 N 20

42×20 ≈

18.38

To see how good a collocation this is, we need to compare itto others

17 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Problems for Mutual Information

A few problems:I Sparse data: infrequent bigrams for infrequent words

get high scoresI Tends to measure independence (value of 0) better

than dependenceI Doesn’t account for how often the words do not appear

together (M&S 1999, table 5.15)

18 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Motivating Contingency Tables

What we can instead get at is: which bigrams are likely, outof a range of possibilities?

Looking at the Arthur Conan Doyle story A Case of Identity,we find the following possibilities for one particular bigram:I sherlock followed by holmesI sherlock followed by some word other than holmesI some word other than sherlock preceding holmesI two words: the first not being sherlock, the second not

being holmes

19 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Contingency Tables

We can count up these different possibilities and put theminto a contingency table (or 2x2 table)

B = holmes B , holmes TotalA = sherlock 7 0 7A , sherlock 39 7059 7098Total 46 7059 7105

The Total row and Total column are the marginals

I Values in this chart are the observed frequencies (fo)

20 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Observed bigram probabilities

Each cell indicates a bigram: divide each cell by totalnumber of bigrams (7105) to get probabilities:

holmes ¬ holmes Totalsherlock 0.00099 0.0 0.00099¬ sherlock 0.00549 0.99353 0.99901Total 0.00647 0.99353 1.0

Marginal probabilities indicate probabilities for a given wordI e.g., p(sherlock) = 0.00099 and p(holmes) = 0.00647

21 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Expected bigram probabilities

Assuming sherlock & holmes are independent results in:

holmes ¬ holmes Totalsherlock 0.00647 x 0.00099 0.99353 x 0.00099 0.00099¬ sherlock 0.00647 x 0.99901 0.99353 x 0.99901 0.99901Total 0.00647 0.99353 1.0

I This is simply pe(w1,w2) = p(w1)p(w2)

22 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Expected bigram frequencies

Multiplying by 7105 (the total number of bigrams) gives usthe expected number of times we should see each bigram:

holmes ¬ holmes Totalsherlock 0.05 6.95 7¬ sherlock 45.5 7052.05 7098Total 46 7059 7105

I Values in this chart are expected frequencies (fe)

23 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Pearson’s chi-square test

The chi-square (χ2) test measures how far the observedvalues are from the expected values:

(5) χ2 =∑ (fo−fe)2

fe

(6)χ2 =

(7−0.05)2

0.05 +(0−6.95)2

6.95 +(39−45.5)2

45.5 +(7059−7052.05)2

7052.05

= 966.05 + 6.95 + 1.048 + 0.006

= 974.05

Looking this up in a table shows it’s unlikely to be chanceI χ2 test does not work well for rare events, i.e., fe ≤ 5I Other tests can be employed using the same tables

24 / 25

Corpus Linguistics

Collocations




PMI

chi-square

Other issues

Other issues

Gries (2009) lists some other points to consider:I Fertility: # of unique types associate with a wordI Lexical gravity: window-based approaches that find the

most informative contextual slotsI Multi-word collocations: breaking down the string into

most informative units for expected frequenciesI Variable n: bottom-up approaches to defining the size

of n for n-gram collocatesI Discontinuous n-grams

25 / 25

Date post:	30-Jan-2018
Category:	Documents
Upload:	buinhi
View:	285 times
Download:	3 times

Corpus Linguistics (L415/L615) -...

Documents