Download - Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Association Measures

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Reminder: Contingency Tables

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

General Remarks

• we will only use data from contingency tables

• we will consider each pair typeon its own, independently from all other pair types( no distributional information)

• we won't distinguish between relational and positional cooccurrences

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Association Measures (AMs)

• goal: assign association score to each pair type = strength of association between components

• high score = strong association

• association in a statistical sense,but there is no precise definition

• positive vs. negative association("colourless green ideas")

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Using Association Scores

• absolute values (cut-off threshold)

• input for higher-order statistics(AMs are first-order statistics) scores should be meaningful

• ranking of collocation candidates only relative scores matter

• rank collocates of given base one marginal frequency fixed only two free parameters

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

First Steps: Proportions

• Workshop on Mechanized Documentation (Washington, 1964)

1

111 R

OP

1

112 C

OP

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


• proportions between 0 and 1

• high proportion = strong (directional) association

• need to combine two proportions into a single association score

• average (P1 + P2) / 2 is not useful

• f=1, f1=1, f2=1000 avg.=0.5005

• f=50, f1=100, f2=100 avg.=0.5

more "conservative" weighting

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


• harmonic mean

• geometric mean

• minimum

• Jaccard

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


• coefficients range from 0 to 1• 1 = total (positive) association• interpretation of lower scores

is less clear• positive vs. negative association?• which score for no association?• what is "no association"??

random combinations

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Expected Frequencies

• assume that types u and v cooccur only by chance

• f1(u) occs. of u and f2(v) occs. of v spread randomly over N tokens

• each instance of u has a chance of f2(v)/N to cooccur with a v

expected # of cooccurrences:

111121 :

)()(E

N

CR

N

vfuf

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


• expected frequencies for all cells of the contingency table

• assuming random combinations( statistical independence)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


• comparison of expected against observed frequencies

• note that row and column sums are the same for both tables

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Mutual Information

• compares O11 with E11

• ratio O11/E11 ranges from 0 to

• 1 = no association (O11=E11)

• usually logarithmic values

• range: - to +• 0 = no assoc., < 0 neg., > 0 pos.

• used in English lexicography

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Low-Frequency Pairs & Random Variation

• large amount of low-frequency data (consequence of Zipf's law)

• a simple (invented) example• A: f=50, f1=100, f2=100, N=1000

O11=50, E11=10, MI = log 5

• B: f=1, f1=1, f2=1, N=1000 O11=1, E11=.001, MI = log 1000

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Low-Frequency Pairs & Random Variation

• three problems with case B• how meaningful is a single example?

(not very much, actually)• could well be a spelling mistake or

noise from automatic processing• we want to make generalisations

(from particular corpus to "language")

this is the domain of statistics:draw inferences about population (=language) from a sample (=corpus)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Statistical Model:Random Sample

• assumption: corpus data is a random sample from the language

base data is a random sample from all coocs. in the language

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


• random sample of size N is described by random variablesUi and Vi (i = 1..N), representing the labels of the i-th bigram token

• notation: U and V as "prototypes"• for a given pair type (u,v),

contingency table can becomputed from Ui and Vi

random variables X11, X12, X21, X22

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


• population parameters 11, 12, 21, 22 for pair type (u,v)

• observed frequenciesO11, O12, O21, O22 represent one particular realisation of the sample

• theory of random samples predicts distribution of X11, X12, X21, X22 from assumptions about the population parameters 11, 12, 21, 22

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Two Footnotes

• vector notation for cont. tables

• population general language• restricted to domain(s), genre(s), ...

covered by source corpus• e.g. black box in computer science

vs. newspapers vs. cooking

),,,(

),,,(

),,,(

22211211

22211211

22211211

kkkkk

OOOOO

XXXXX

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Sampling Distribution

• multinomial sampling distribution

• each individual cell count Xij has a binomial distribution (but these are not independent)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Sampling Distribution

• given assumptions about the population parameters, we can compute the likelihood of the observed contingency table

• relatively high likelihood= consistent with assumptions

• relatively low likelihood= evidence against assumptions(inversely proportional to likelihood)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Adequacy of the Statistical Model

• particular sequence of pair tokens is irrelevant, only the overall frequencies matter ( sufficiency)

• randomness assumption (random sample from fixed population)• independence of pair tokens• constancy of population parameters

• violations problematic only when they affect sampling distribution

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Adequacy of the Statistical Model

• three causes of non-randomness• local dependencies (e.g. syntax)

usually not problematic• inhomogeneity of source corpus

(speakers, domains, topics, ...) mixture population

• repetition / clustering of bigrams can be a serious problem(does not affect segment-based data if clustered within segments)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Making Assumptions about the Population Parameters

• population parameters (, 1, 2) are unknown

• best guess from observation: MLE = maximum-likelihood estimate

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Making Assumptions about the Population Parameters

• conditional probabilities with MLE

• Dice coefficient etc. are MLE for population characteristics

• MI is MLE for log( /(1 2))

unreliable for small frequencies

1

11

2

1

11

1

)|(

)|(

C

OvVuUP

R

OuUvVP

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Null Hypothesis

• null hypothesis H0: no association= independence of instances, i.e.P(U=u V=v) = P(U=u) P(V=v)

• not all parameters determined

• MLE maximise probability of observed data under H0

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Likelihood Measures

• probability of observed data under H0 (with MLE)

• probability of single cell: X11

should be most "informative"

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Likelihood Measures

• small likelihood values = strong association

• computed probabilities are often extremely small

• use negative base-10 logarithm more convenient scale high scores indicate strong association

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Problems of Likelihood Measures

• three reasons for low likelihood• observed data is inconsistent with

the null hypothesis because of strong association

• association may also be negative (fewer coocs. than expected)

• observed data is consistent, but probability mass is spread across many similar contingency tables

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Problems of Likelihood Measures

• high frequency = low likelihood

• e.g. binomial likelihood• O11=1, E11=1 L = 0.3679

• O11=1000, E11=1000 L = 0.0126

• O11=4, E11=1 L 0.0126

• need to "normalise" likelihood

• NB: likelihood association measures often have good empirical results nonetheless

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Likelihood Ratios

• simplest normalisation technique

• divide maximum probability of data under H0 by unconstrained maximum probability

• suggested by Dunning (1993)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Statistical Hypothesis Tests

• compute probability of group of outcomes instead of single one

• observed contingency table is grouped with all tables that provide at least the same amount of evidence against H0

• total probability is known as the p-value or significance

• problem: ranking of cont. tables

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Asymptotic Tests

• asymptotic tests defined ranking of contingency tables explicitly

• compute test statistic from data• higher values =

more evidence against H0

• can use test statistic as an AM• theory: approximation of p-value

associated with test statistic(accurate in the limit N )

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Asymptotic Tests

• standard test for independence is Pearson's chi-squared test

• limiting distribution = 2 distribution with df=1

• number of degrees of freedom was subject of a long debate

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Two-Sided Tests

• chi-squared test is two-sided, i.e. no difference between positive and negative association

• ignore small number of pairs with (non-total) negative association

• or convert to one-sided test:reject H0 only when O11 > E11

• p-value is usually divided by 2

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Yates Continuity Correction

• Pearson's chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution( "normal theory")

• estimating probabilities P(Xij k) from normal distribution introduces systematic errors

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Yates' Continuity Correction

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


• generic form of Yates' continuity correction for contingency tables

• usefulness is still controversial (criticised as too conservative)

• applicability for chi-squared test is generally accepted

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Asymptotic Tests

• different form of chi-squared test (comparison of two binomials) is equivalent to independence test

• special eq. with Yates' correction

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Asymptotic Tests

• can also use log-likelihood ratio as a test statistic (two-sided)

• limiting distribution is found to be 2 distribution with df=1

• more conservative than Pearson's chi-squared test

• Dunning (1993) showed that Pearson's test over-estimates evidence against H0 (simulation)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Something I'd Rather Not Mention

• Church & Hanks: O11 and E11 are both random variables

• H0: expected values are equal

• assume normal distribution with unknown variance

• compare O11 and E11 with Student's t-test, estimating unknown variance from the observed data

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Something I'd Rather Not Mention

• one-sided test• statistical model is questionable• limiting distribution:

t-distribution with df N• even more conservative than

log-likelihood (low-frequency data)

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Exact Tests

• problem: how to establish ranking of contingency tables

• solution: reduce set of alternatives

• if we consider only the cell X11,the difference X11 – E11 gives a sensible ranking: binomial test

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Exact Tests

• another solution: marginal frequencies do not provide evidence for or against H0 ( "ancillary" statistics)

• condition on fixed row and column sums R1, R2, C1, C2

• conditional hypergeometric distribution does not depend on parameters 1 and 2

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Exact Tests

• X11 is the only free parameter

• we can use X11 – E11 for ranking

• Fisher's exact test (Pedersen 1996)

• computationally expensive• numerical difficulties

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Hypothesis Tests

• Fisher's test is now widely accepted as most appropriate

• tends to be conservative

• log-likelihood gives good approximation of "correct" p-values(slightly less conservative)

• chi-squared over-estimates

• t-score far too conservative

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Other Approaches to Measuring Association

• information-theoretic (MI, entropy) equivalent to log-likelihood

• combined measures ("boosting")• conservative estimates instead of

MLE (confidence intervals) • hypothesis tests with different null

hypothesis: = C 1 2

• mixture of conservative estimates and hypothesis tests?

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Implementation

• one-sided vs. two-sided tests

• need special software to obtain p-values for asymptotic tests

• numerical accuracy

• beware of zero frequencies!

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Errr.... Help!? Software?

• Ted Pedersen's N-gram Statistics Package (NSP)[Perl, portable, easy to use]

• UCS Toolkit will be available soon from www.collocations.de[Perl/Linux, some prerequisites, for the more ambitious :o) ]

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

More Association Measures

• lots of association measures

• will be updated

• references

• slides from this course

• under construction

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association Measures

• mathematical discussion• very complex• results only for special cases

• numerical simulation• computationally expensive• Dunning (1993, 1998)

• lazy man's approach• construct mock data set where

frequencies vary systematically

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 10,000,000

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 10,000,000

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS


Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS