+ All Categories
Home > Documents > Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Date post: 18-Jan-2016
Category:
Upload: bertram-cuthbert-carter
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
64
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures
Transcript
Page 1: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Association Measures

Page 2: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Reminder: Contingency Tables

Page 3: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

General Remarks

• we will only use data from contingency tables

• we will consider each pair typeon its own, independently from all other pair types( no distributional information)

• we won't distinguish between relational and positional cooccurrences

Page 4: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Association Measures (AMs)

• goal: assign association score to each pair type = strength of association between components

• high score = strong association

• association in a statistical sense,but there is no precise definition

• positive vs. negative association("colourless green ideas")

Page 5: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Using Association Scores

• absolute values (cut-off threshold)

• input for higher-order statistics(AMs are first-order statistics) scores should be meaningful

• ranking of collocation candidates only relative scores matter

• rank collocates of given base one marginal frequency fixed only two free parameters

Page 6: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

First Steps: Proportions

• Workshop on Mechanized Documentation (Washington, 1964)

1

111 R

OP

1

112 C

OP

Page 7: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

First Steps: Proportions

• proportions between 0 and 1

• high proportion = strong (directional) association

• need to combine two proportions into a single association score

• average (P1 + P2) / 2 is not useful

• f=1, f1=1, f2=1000 avg.=0.5005

• f=50, f1=100, f2=100 avg.=0.5

more "conservative" weighting

Page 8: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

First Steps: Proportions

• harmonic mean

• geometric mean

• minimum

• Jaccard

Page 9: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

First Steps: Proportions

• coefficients range from 0 to 1• 1 = total (positive) association• interpretation of lower scores

is less clear• positive vs. negative association?• which score for no association?• what is "no association"??

random combinations

Page 10: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Expected Frequencies

• assume that types u and v cooccur only by chance

• f1(u) occs. of u and f2(v) occs. of v spread randomly over N tokens

• each instance of u has a chance of f2(v)/N to cooccur with a v

expected # of cooccurrences:

111121 :

)()(E

N

CR

N

vfuf

Page 11: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Expected Frequencies

• expected frequencies for all cells of the contingency table

• assuming random combinations( statistical independence)

Page 12: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Expected Frequencies

• comparison of expected against observed frequencies

• note that row and column sums are the same for both tables

Page 13: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Mutual Information

• compares O11 with E11

• ratio O11/E11 ranges from 0 to

• 1 = no association (O11=E11)

• usually logarithmic values

• range: - to +• 0 = no assoc., < 0 neg., > 0 pos.

• used in English lexicography

Page 14: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Low-Frequency Pairs & Random Variation

• large amount of low-frequency data (consequence of Zipf's law)

• a simple (invented) example• A: f=50, f1=100, f2=100, N=1000

O11=50, E11=10, MI = log 5

• B: f=1, f1=1, f2=1, N=1000 O11=1, E11=.001, MI = log 1000

Page 15: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Low-Frequency Pairs & Random Variation

• three problems with case B• how meaningful is a single example?

(not very much, actually)• could well be a spelling mistake or

noise from automatic processing• we want to make generalisations

(from particular corpus to "language")

this is the domain of statistics:draw inferences about population (=language) from a sample (=corpus)

Page 16: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Statistical Model:Random Sample

• assumption: corpus data is a random sample from the language

base data is a random sample from all coocs. in the language

Page 17: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Statistical Model:Random Sample

• random sample of size N is described by random variablesUi and Vi (i = 1..N), representing the labels of the i-th bigram token

• notation: U and V as "prototypes"• for a given pair type (u,v),

contingency table can becomputed from Ui and Vi

random variables X11, X12, X21, X22

Page 18: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Statistical Model:Random Sample

• population parameters 11, 12, 21, 22 for pair type (u,v)

• observed frequenciesO11, O12, O21, O22 represent one particular realisation of the sample

• theory of random samples predicts distribution of X11, X12, X21, X22 from assumptions about the population parameters 11, 12, 21, 22

Page 19: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Statistical Model:Random Sample

Page 20: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Two Footnotes

• vector notation for cont. tables

• population general language• restricted to domain(s), genre(s), ...

covered by source corpus• e.g. black box in computer science

vs. newspapers vs. cooking

),,,(

),,,(

),,,(

22211211

22211211

22211211

kkkkk

OOOOO

XXXXX

Page 21: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Sampling Distribution

• multinomial sampling distribution

• each individual cell count Xij has a binomial distribution (but these are not independent)

Page 22: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Sampling Distribution

• given assumptions about the population parameters, we can compute the likelihood of the observed contingency table

• relatively high likelihood= consistent with assumptions

• relatively low likelihood= evidence against assumptions(inversely proportional to likelihood)

Page 23: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Adequacy of the Statistical Model

• particular sequence of pair tokens is irrelevant, only the overall frequencies matter ( sufficiency)

• randomness assumption (random sample from fixed population)• independence of pair tokens• constancy of population parameters

• violations problematic only when they affect sampling distribution

Page 24: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Adequacy of the Statistical Model

• three causes of non-randomness• local dependencies (e.g. syntax)

usually not problematic• inhomogeneity of source corpus

(speakers, domains, topics, ...) mixture population

• repetition / clustering of bigrams can be a serious problem(does not affect segment-based data if clustered within segments)

Page 25: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Making Assumptions about the Population Parameters

• population parameters (, 1, 2) are unknown

• best guess from observation: MLE = maximum-likelihood estimate

Page 26: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Making Assumptions about the Population Parameters

• conditional probabilities with MLE

• Dice coefficient etc. are MLE for population characteristics

• MI is MLE for log( /(1 2))

unreliable for small frequencies

1

11

2

1

11

1

)|(

)|(

C

OvVuUP

R

OuUvVP

Page 27: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

The Null Hypothesis

• null hypothesis H0: no association= independence of instances, i.e.P(U=u V=v) = P(U=u) P(V=v)

• not all parameters determined

• MLE maximise probability of observed data under H0

Page 28: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Likelihood Measures

• probability of observed data under H0 (with MLE)

• probability of single cell: X11

should be most "informative"

Page 29: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Likelihood Measures

• small likelihood values = strong association

• computed probabilities are often extremely small

• use negative base-10 logarithm more convenient scale high scores indicate strong association

Page 30: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Problems of Likelihood Measures

• three reasons for low likelihood• observed data is inconsistent with

the null hypothesis because of strong association

• association may also be negative (fewer coocs. than expected)

• observed data is consistent, but probability mass is spread across many similar contingency tables

Page 31: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Problems of Likelihood Measures

• high frequency = low likelihood

• e.g. binomial likelihood• O11=1, E11=1 L = 0.3679

• O11=1000, E11=1000 L = 0.0126

• O11=4, E11=1 L 0.0126

• need to "normalise" likelihood

• NB: likelihood association measures often have good empirical results nonetheless

Page 32: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Likelihood Ratios

• simplest normalisation technique

• divide maximum probability of data under H0 by unconstrained maximum probability

• suggested by Dunning (1993)

Page 33: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Statistical Hypothesis Tests

• compute probability of group of outcomes instead of single one

• observed contingency table is grouped with all tables that provide at least the same amount of evidence against H0

• total probability is known as the p-value or significance

• problem: ranking of cont. tables

Page 34: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Asymptotic Tests

• asymptotic tests defined ranking of contingency tables explicitly

• compute test statistic from data• higher values =

more evidence against H0

• can use test statistic as an AM• theory: approximation of p-value

associated with test statistic(accurate in the limit N )

Page 35: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Asymptotic Tests

• standard test for independence is Pearson's chi-squared test

• limiting distribution = 2 distribution with df=1

• number of degrees of freedom was subject of a long debate

Page 36: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Two-Sided Tests

• chi-squared test is two-sided, i.e. no difference between positive and negative association

• ignore small number of pairs with (non-total) negative association

• or convert to one-sided test:reject H0 only when O11 > E11

• p-value is usually divided by 2

Page 37: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Yates Continuity Correction

• Pearson's chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution( "normal theory")

• estimating probabilities P(Xij k) from normal distribution introduces systematic errors

Page 38: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Yates' Continuity Correction

Page 39: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Yates' Continuity Correction

Page 40: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Yates' Continuity Correction

• generic form of Yates' continuity correction for contingency tables

• usefulness is still controversial (criticised as too conservative)

• applicability for chi-squared test is generally accepted

Page 41: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Asymptotic Tests

• different form of chi-squared test (comparison of two binomials) is equivalent to independence test

• special eq. with Yates' correction

Page 42: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Asymptotic Tests

• can also use log-likelihood ratio as a test statistic (two-sided)

• limiting distribution is found to be 2 distribution with df=1

• more conservative than Pearson's chi-squared test

• Dunning (1993) showed that Pearson's test over-estimates evidence against H0 (simulation)

Page 43: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Something I'd Rather Not Mention

• Church & Hanks: O11 and E11 are both random variables

• H0: expected values are equal

• assume normal distribution with unknown variance

• compare O11 and E11 with Student's t-test, estimating unknown variance from the observed data

Page 44: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Something I'd Rather Not Mention

• one-sided test• statistical model is questionable• limiting distribution:

t-distribution with df N• even more conservative than

log-likelihood (low-frequency data)

Page 45: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Exact Tests

• problem: how to establish ranking of contingency tables

• solution: reduce set of alternatives

• if we consider only the cell X11,the difference X11 – E11 gives a sensible ranking: binomial test

Page 46: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Exact Tests

• another solution: marginal frequencies do not provide evidence for or against H0 ( "ancillary" statistics)

• condition on fixed row and column sums R1, R2, C1, C2

• conditional hypergeometric distribution does not depend on parameters 1 and 2

Page 47: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Exact Tests

• X11 is the only free parameter

• we can use X11 – E11 for ranking

• Fisher's exact test (Pedersen 1996)

• computationally expensive• numerical difficulties

Page 48: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Hypothesis Tests

• Fisher's test is now widely accepted as most appropriate

• tends to be conservative

• log-likelihood gives good approximation of "correct" p-values(slightly less conservative)

• chi-squared over-estimates

• t-score far too conservative

Page 49: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Other Approaches to Measuring Association

• information-theoretic (MI, entropy) equivalent to log-likelihood

• combined measures ("boosting")• conservative estimates instead of

MLE (confidence intervals) • hypothesis tests with different null

hypothesis: = C 1 2

• mixture of conservative estimates and hypothesis tests?

Page 50: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Implementation

• one-sided vs. two-sided tests

• need special software to obtain p-values for asymptotic tests

• numerical accuracy

• beware of zero frequencies!

Page 51: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Errr.... Help!? Software?

• Ted Pedersen's N-gram Statistics Package (NSP)[Perl, portable, easy to use]

• UCS Toolkit will be available soon from www.collocations.de[Perl/Linux, some prerequisites, for the more ambitious :o) ]

Page 52: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

More Association Measures

• lots of association measures

• will be updated

• references

• slides from this course

• under construction

Page 53: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association Measures

• mathematical discussion• very complex• results only for special cases

• numerical simulation• computationally expensive• Dunning (1993, 1998)

• lazy man's approach• construct mock data set where

frequencies vary systematically

Page 54: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Page 55: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Page 56: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Page 57: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Page 58: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Page 59: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 10,000,000

Page 60: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 10,000,000

Page 61: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Page 62: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Page 63: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000

Page 64: Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Ste

fan

Eve

rt,

IMS

- U

ni

Stu

ttg

art

Bri

git

te K

ren

n,

ÖF

AI

Wie

n

IMS

Comparing Association MeasuresN = 100,000


Recommended