Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Association Measures
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Reminder: Contingency Tables
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
General Remarks
• we will only use data from contingency tables
• we will consider each pair typeon its own, independently from all other pair types( no distributional information)
• we won't distinguish between relational and positional cooccurrences
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Association Measures (AMs)
• goal: assign association score to each pair type = strength of association between components
• high score = strong association
• association in a statistical sense,but there is no precise definition
• positive vs. negative association("colourless green ideas")
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Using Association Scores
• absolute values (cut-off threshold)
• input for higher-order statistics(AMs are first-order statistics) scores should be meaningful
• ranking of collocation candidates only relative scores matter
• rank collocates of given base one marginal frequency fixed only two free parameters
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
First Steps: Proportions
• Workshop on Mechanized Documentation (Washington, 1964)
1
111 R
OP
1
112 C
OP
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
First Steps: Proportions
• proportions between 0 and 1
• high proportion = strong (directional) association
• need to combine two proportions into a single association score
• average (P1 + P2) / 2 is not useful
• f=1, f1=1, f2=1000 avg.=0.5005
• f=50, f1=100, f2=100 avg.=0.5
more "conservative" weighting
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
First Steps: Proportions
• harmonic mean
• geometric mean
• minimum
• Jaccard
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
First Steps: Proportions
• coefficients range from 0 to 1• 1 = total (positive) association• interpretation of lower scores
is less clear• positive vs. negative association?• which score for no association?• what is "no association"??
random combinations
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Expected Frequencies
• assume that types u and v cooccur only by chance
• f1(u) occs. of u and f2(v) occs. of v spread randomly over N tokens
• each instance of u has a chance of f2(v)/N to cooccur with a v
expected # of cooccurrences:
111121 :
)()(E
N
CR
N
vfuf
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Expected Frequencies
• expected frequencies for all cells of the contingency table
• assuming random combinations( statistical independence)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Expected Frequencies
• comparison of expected against observed frequencies
• note that row and column sums are the same for both tables
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Mutual Information
• compares O11 with E11
• ratio O11/E11 ranges from 0 to
• 1 = no association (O11=E11)
• usually logarithmic values
• range: - to +• 0 = no assoc., < 0 neg., > 0 pos.
• used in English lexicography
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Low-Frequency Pairs & Random Variation
• large amount of low-frequency data (consequence of Zipf's law)
• a simple (invented) example• A: f=50, f1=100, f2=100, N=1000
O11=50, E11=10, MI = log 5
• B: f=1, f1=1, f2=1, N=1000 O11=1, E11=.001, MI = log 1000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Low-Frequency Pairs & Random Variation
• three problems with case B• how meaningful is a single example?
(not very much, actually)• could well be a spelling mistake or
noise from automatic processing• we want to make generalisations
(from particular corpus to "language")
this is the domain of statistics:draw inferences about population (=language) from a sample (=corpus)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Statistical Model:Random Sample
• assumption: corpus data is a random sample from the language
base data is a random sample from all coocs. in the language
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Statistical Model:Random Sample
• random sample of size N is described by random variablesUi and Vi (i = 1..N), representing the labels of the i-th bigram token
• notation: U and V as "prototypes"• for a given pair type (u,v),
contingency table can becomputed from Ui and Vi
random variables X11, X12, X21, X22
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Statistical Model:Random Sample
• population parameters 11, 12, 21, 22 for pair type (u,v)
• observed frequenciesO11, O12, O21, O22 represent one particular realisation of the sample
• theory of random samples predicts distribution of X11, X12, X21, X22 from assumptions about the population parameters 11, 12, 21, 22
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Statistical Model:Random Sample
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Two Footnotes
• vector notation for cont. tables
• population general language• restricted to domain(s), genre(s), ...
covered by source corpus• e.g. black box in computer science
vs. newspapers vs. cooking
),,,(
),,,(
),,,(
22211211
22211211
22211211
kkkkk
OOOOO
XXXXX
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Sampling Distribution
• multinomial sampling distribution
• each individual cell count Xij has a binomial distribution (but these are not independent)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Sampling Distribution
• given assumptions about the population parameters, we can compute the likelihood of the observed contingency table
• relatively high likelihood= consistent with assumptions
• relatively low likelihood= evidence against assumptions(inversely proportional to likelihood)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Adequacy of the Statistical Model
• particular sequence of pair tokens is irrelevant, only the overall frequencies matter ( sufficiency)
• randomness assumption (random sample from fixed population)• independence of pair tokens• constancy of population parameters
• violations problematic only when they affect sampling distribution
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Adequacy of the Statistical Model
• three causes of non-randomness• local dependencies (e.g. syntax)
usually not problematic• inhomogeneity of source corpus
(speakers, domains, topics, ...) mixture population
• repetition / clustering of bigrams can be a serious problem(does not affect segment-based data if clustered within segments)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Making Assumptions about the Population Parameters
• population parameters (, 1, 2) are unknown
• best guess from observation: MLE = maximum-likelihood estimate
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Making Assumptions about the Population Parameters
• conditional probabilities with MLE
• Dice coefficient etc. are MLE for population characteristics
• MI is MLE for log( /(1 2))
unreliable for small frequencies
1
11
2
1
11
1
)|(
)|(
C
OvVuUP
R
OuUvVP
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
The Null Hypothesis
• null hypothesis H0: no association= independence of instances, i.e.P(U=u V=v) = P(U=u) P(V=v)
• not all parameters determined
• MLE maximise probability of observed data under H0
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Likelihood Measures
• probability of observed data under H0 (with MLE)
• probability of single cell: X11
should be most "informative"
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Likelihood Measures
• small likelihood values = strong association
• computed probabilities are often extremely small
• use negative base-10 logarithm more convenient scale high scores indicate strong association
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Problems of Likelihood Measures
• three reasons for low likelihood• observed data is inconsistent with
the null hypothesis because of strong association
• association may also be negative (fewer coocs. than expected)
• observed data is consistent, but probability mass is spread across many similar contingency tables
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Problems of Likelihood Measures
• high frequency = low likelihood
• e.g. binomial likelihood• O11=1, E11=1 L = 0.3679
• O11=1000, E11=1000 L = 0.0126
• O11=4, E11=1 L 0.0126
• need to "normalise" likelihood
• NB: likelihood association measures often have good empirical results nonetheless
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Likelihood Ratios
• simplest normalisation technique
• divide maximum probability of data under H0 by unconstrained maximum probability
• suggested by Dunning (1993)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Statistical Hypothesis Tests
• compute probability of group of outcomes instead of single one
• observed contingency table is grouped with all tables that provide at least the same amount of evidence against H0
• total probability is known as the p-value or significance
• problem: ranking of cont. tables
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Asymptotic Tests
• asymptotic tests defined ranking of contingency tables explicitly
• compute test statistic from data• higher values =
more evidence against H0
• can use test statistic as an AM• theory: approximation of p-value
associated with test statistic(accurate in the limit N )
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Asymptotic Tests
• standard test for independence is Pearson's chi-squared test
• limiting distribution = 2 distribution with df=1
• number of degrees of freedom was subject of a long debate
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Two-Sided Tests
• chi-squared test is two-sided, i.e. no difference between positive and negative association
• ignore small number of pairs with (non-total) negative association
• or convert to one-sided test:reject H0 only when O11 > E11
• p-value is usually divided by 2
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Yates Continuity Correction
• Pearson's chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution( "normal theory")
• estimating probabilities P(Xij k) from normal distribution introduces systematic errors
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Yates' Continuity Correction
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Yates' Continuity Correction
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Yates' Continuity Correction
• generic form of Yates' continuity correction for contingency tables
• usefulness is still controversial (criticised as too conservative)
• applicability for chi-squared test is generally accepted
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Asymptotic Tests
• different form of chi-squared test (comparison of two binomials) is equivalent to independence test
• special eq. with Yates' correction
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Asymptotic Tests
• can also use log-likelihood ratio as a test statistic (two-sided)
• limiting distribution is found to be 2 distribution with df=1
• more conservative than Pearson's chi-squared test
• Dunning (1993) showed that Pearson's test over-estimates evidence against H0 (simulation)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Something I'd Rather Not Mention
• Church & Hanks: O11 and E11 are both random variables
• H0: expected values are equal
• assume normal distribution with unknown variance
• compare O11 and E11 with Student's t-test, estimating unknown variance from the observed data
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Something I'd Rather Not Mention
• one-sided test• statistical model is questionable• limiting distribution:
t-distribution with df N• even more conservative than
log-likelihood (low-frequency data)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Exact Tests
• problem: how to establish ranking of contingency tables
• solution: reduce set of alternatives
• if we consider only the cell X11,the difference X11 – E11 gives a sensible ranking: binomial test
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Exact Tests
• another solution: marginal frequencies do not provide evidence for or against H0 ( "ancillary" statistics)
• condition on fixed row and column sums R1, R2, C1, C2
• conditional hypergeometric distribution does not depend on parameters 1 and 2
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Exact Tests
• X11 is the only free parameter
• we can use X11 – E11 for ranking
• Fisher's exact test (Pedersen 1996)
• computationally expensive• numerical difficulties
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Hypothesis Tests
• Fisher's test is now widely accepted as most appropriate
• tends to be conservative
• log-likelihood gives good approximation of "correct" p-values(slightly less conservative)
• chi-squared over-estimates
• t-score far too conservative
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Other Approaches to Measuring Association
• information-theoretic (MI, entropy) equivalent to log-likelihood
• combined measures ("boosting")• conservative estimates instead of
MLE (confidence intervals) • hypothesis tests with different null
hypothesis: = C 1 2
• mixture of conservative estimates and hypothesis tests?
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Implementation
• one-sided vs. two-sided tests
• need special software to obtain p-values for asymptotic tests
• numerical accuracy
• beware of zero frequencies!
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Errr.... Help!? Software?
• Ted Pedersen's N-gram Statistics Package (NSP)[Perl, portable, easy to use]
• UCS Toolkit will be available soon from www.collocations.de[Perl/Linux, some prerequisites, for the more ambitious :o) ]
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
More Association Measures
• lots of association measures
• will be updated
• references
• slides from this course
• under construction
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association Measures
• mathematical discussion• very complex• results only for special cases
• numerical simulation• computationally expensive• Dunning (1993, 1998)
• lazy man's approach• construct mock data set where
frequencies vary systematically
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 10,000,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 10,000,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Comparing Association MeasuresN = 100,000