Download - Relative entropy, and naive discriminative learning 2011/ndr.pdf · Self-paced Reading Latency (ms)-0.6 -0.2 0.2 500 600 700 800 Cosine Similarity Self-paced Reading Latency (ms)

Relative entropy, and naivediscriminative learning

Harald Baayen

in collaboration with

Petar Milin, Peter Hendrix, Dusica Filipovic-Markovic,and Marco Marelli

San Diego, January 15–16, 2011

overview

I Milin, Filipovic-Durdevic & Moscoso del Prado (2009)

I Experiment 1: replication with primed self-paced reading

I Modeling with naive discriminative learning

I Experiment 2: relative entropy in syntax (lex. dec.)

I Experiment 3: relative entropy in syntax (eye-tracking)

I Relative entropy, random intercepts, and stem support

Milin et al. 2009

I {p}: the probability distribution of exponents

of a given lemma

I {q}: the probability distribution of exponents

across all lemmata in an inflectional class

I relative entropy RE =∑

i pi log2(pi/qi )

I greater relative entropy, longer lexical decision latencies

Replication study using primed self-paced reading

I weighted relative entropy:∑

ipiwi∑i piwi

log2piqi

I weights wi = f (targeti )f (primei )

I a greater WRE predicts longer latencies

I but interactions with masculine gender and nominative case

Interactions with weighted relative entropy

4 5 6 7 8

500

600

700

800

Target Lemma Frequency

Sel

f−pa

ced

Rea

ding

Lat

ency

(m

s)

−4 −2 0 2 4

500

600

700

800

Prime Word Frequency

Sel

f−pa

ced

Rea

ding

Lat

ency

(m

s)

0.0 0.4 0.8

500

600

700

800

Normalized Levenshtein Dist.

Sel

f−pa

ced

Rea

ding

Lat

ency

(m

s)

−0.6 −0.2 0.2

500

600

700

800

Cosine Similarity

Sel

f−pa

ced

Rea

ding

Lat

ency

(m

s)

0.0 1.0 2.0

500

600

700

800

Weighted Relative Entropy

Sel

f−pa

ced

Rea

ding

Lat

ency

(m

s)

target in nominative:FALSETRUE

0.0 1.0 2.0

500

600

700

800


Sel

f−pa

ced

Rea

ding

Lat

ency

(m

s)

target masculine gender:FALSETRUE

Modeling (weighted) relative entropy effects

sources of inspiration

I recent work by Michael Ramscar on the Rescorla-Wagner

equations in language acquisition

I old work by Fermin Moscoso del Prado Martin

(PhD thesis, chapter 10)

I discussions with Jim Blevins

Models of morphological processing:

the ‘standard’ model (Rastle, Davis)

form

morphology

semantics

w i n e r #w wi in nn ne er r#

WINNER

win er

Our approach: amorphous morphology

form

morphology

semantics

w i n e r #w wi in nn ne er r#

WIN AGENT

orthographic cues

I letters and letter pairs as cues for meanings

I legal scrabble words beginning with qa

I qaid (Muslim tribal chief)

I qanat (gently sloping underground tunnel for irrigation)

I qat (leaf of the shrub Catha edulis)

I our model is based on a generalization of this idea

orthographic cues







orthographic cues







orthographic cues







orthographic cues







naive discriminative learning

I Links between orthography (cues) and semantics (outcomes)

are established through discriminative learning

I Rescorla-Wagner equations for discriminative learning

(Rescorla & Wagner, 1972)

I Equilibrium equations for the Rescorla-Wagner equations

(Danks, 2003)

I The activation for a given meaning outcome is the sum of all

associative links between the (active) input letters and letter

pairs and that meaning

Rescorla-Wagner equations

V t+1i = V t

i + ∆V ti

with

∆V ti =

0 if absent(Ci , t)

αiβ1

(λ−

∑present(Cj , t) Vj

)if present(Cj , t) & present(O, t)

αiβ2

(0−

∑present(Cj , t) Vj

)if present(Cj , t) & absent(O, t)

I if a cue is reliable, it’s connection strength will increase

I if a cue is unreliable, it’s connection strength will decrease

I if many cues are relevant simultaneously, the contribution of a

single cue from the set will be small

Example lexicon

Word Frequency Lexical Meaning Number

hand 10 handhands 20 hand pluralland 8 landlands 3 land pluraland 35 andsad 18 sadas 35 aslad 102 ladlads 54 lad plurallass 134 lass

The Rescorla-Wagner equations applied

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

t

wei

ght

h − hand

0 2000 4000 6000 8000 10000

0.00

0.10

0.20

t

wei

ght

s − plural

0 2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

t

wei

ght

a − as

0 2000 4000 6000 8000 10000

−0.

3−

0.1

0.0

0.1

t

wei

ght

s − as

a shortcut straight to the adult stable state

I equilibrium equations (Danks) when the system is in a

stable state, the connection weights to a given meaning can

be estimated by solving a set of linear equations

Pr(C0|C0) Pr(C1|C0) . . . Pr(Cn|C0)

Pr(C0|C1) Pr(C1|C1) . . . Pr(Cn|C1)

. . . . . . . . . . . .

Pr(C0|Cn) Pr(C1|Cn) . . . Pr(Cn|Cn)

V0

V1

. . .

Vn

=

Pr(O|C0)

Pr(O|C1)

. . .

Pr(O|Cn)

.

Vi : association strength of i-th cue Ci to outcome O

I the association strengths Vj optimize the conditional

outcomes given the conditional co-occurrence

probabilities characterizing the input space

from weights to meaning activations

I the activation ai of meaning i is the sum

of its incoming connection strengths

ai =∑j

Vji

I the greater the meaning activation,

the shorter the response latencies

I simplest case:

RTsimi ∝ −aiI a log transformation may be required to remove the right skew

from the distribution of simulated RTs:

RTsimi ∝ log(1/ai )

the naive discriminative reader

I basic engine is parameter-free, and driven completely and only

by the language input

I the model is computationally undemanding: building the

weight matrix from a lexicon of 11 million phrases takes 10

minutes on my desktop

I implementation in R

from weights to meaning activations

I for Serbian case-inflected nouns, sum over lexical meanings

and grammatical meanings

I for priming, we use Ratcliff-McKoon’s compound cue theory:

S =10∑i=1

(awPi · a1−wTi ) (0 ≤ w ≤ 0.5) (1)

I this introduces a free parameter for the prime duration

I we also use one free parameter to model the time required to

plan and execute a second fixation for longer words

Observed and simulated latencies (r = 0.24)

4 6 8 10 12

−12

.2−

12.0

−11

.8−

11.6

−11

.4−

11.2

Word Length

Sim

ulat

ed R

T

6.25

6.30

6.35

6.40

6.45

log

obse

rved

RT

●

●

●

−12

.2−

12.0

−11

.8−

11.6

−11

.4−

11.2

Prime Condition

Sim

ulat

ed R

T

DD DSSD SS

●

●

●

6.25

6.30

6.35

6.40

6.45

log

obse

rved

RT

0 2 4 6

−12

.2−

12.0

−11

.8−

11.6

−11

.4−

11.2

Target Form Frequency

Sim

ulat

ed R

T

6.25

6.30

6.35

6.40

6.45

log

obse

rved

RT

−4 −2 0 2 4

−12

.2−

12.0

−11

.8−

11.6

−11

.4−

11.2

Prime Form Frequency

Sim

ulat

ed R

T

6.25

6.30

6.35

6.40

6.45

log

obse

rved

RT

0.0 0.5 1.0 1.5 2.0

−12

.2−

12.0

−11

.8−

11.6

−11

.4−

11.2


Sim

ulat

ed R

T

6.25

6.30

6.35

6.40

6.45

log

obse

rved

RT

simulatedobserved

no effect of RE in the simulation for masculine nouns

Activation of case meanings

activation

accusativedative

genitiveinstrumental

locativenominative

pluralsingular

0.0 0.5 1.0

●

●

●

●

žena

●

●

●

●

ženama

0.0 0.5 1.0

●

●

●

ženeaccusative

dativegenitive

instrumentallocative

nominativeplural

singular

●

●

●

●

●

ženi

0.0 0.5 1.0

●

●

●

●

●

●

ženom

●

●

●

●

●

●

ženu

Summary Experiment 1

I relative entropy effects persist in sentential reading

I they are modified, but not destroyed by priming

I the interaction with masculine gender follows from the

distributional properties of the lexical input

I the interaction with nominative case remains unaccounted for

(functions and meanings?)

I frequency effects for complex words and paradigmatic effects

can arise without representations for complex words or

representational structures for paradigms

Experiment 2: Relative entropy in syntax

phrase phrasal phrasal preposition prepositional prepositionalfrequency probability frequency probability

on a plant 28608 0.279 on 177908042 0.372in a plant 52579 0.513 in 253850053 0.531under a plant 7346 0.072 under 10746880 0.022above a plant 0 0.000 above 2517797 0.005through a plant 0 0.000 through 3632886 0.008behind a plant 760 0.007 behind 3979162 0.008into a plant 13289 0.130 into 25279478 0.053

40 spatial prepositions

prepositional relative entropy

training data

I the model is trained on 11,172,554 two and three-word

phrases from the British National Corpus, comprising

26,441,155 word tokens

I phrases have as last word one of 24710 monomorphemic

words, or any bimorphemic compounds, derived and inflected

words containing one of the 24710 monomorphemic words

constructions sampled

Preposition + Article + Noun about a ballet

Preposition + Possessive Pron. + Noun about her actions

Preposition + X + Noun about actual costs

Preposition + Noun about achievements

X’s + Noun protege’s abilities

Article + Noun a box

Article + X + Noun the abdominal appendages

Possessive Pronoun + Noun their abbots

Article + X’s + Noun the accountant’s bill

Pronoun + Auxiliary + Verb they are arrested

Pronoun + Verb he achieves

Auxiliary + Verb is abandoning

Article + Adjective the acute

processing of monomorphemic words

I stimuli: 1289 monomorphemic nouns

I lexical decision latencies from the English Lexicon Project

I simulated lexical decision latencies

I predictors

I Family Size

I Inflectional Entropy

I Written Frequency

I Number of Morphologically Complex Synonyms

I Neighborhood Density

I Mean Bigram Frequency

I Noun-Verb Ratio

I Length

I Prepositional Relative Entropy

results

correlation for the observed and simulated response latencies:r = 0.55, t(1287) = 23.83, p <0.001

−0.04 −0.02 0.00 0.02 0.04

−1.

0−

0.5

0.0

0.5

1.0

1.5

observed coefficients

expe

cted

coe

ffici

ents

* MeanBigramFrequency

* WrittenFrequency

* FamilySize

* Length

* NounToVerbRatio

* InflectionalEntropy

* ComplexSynsetsCount

* PrepositionalRE

* Ncount

r = 0.7, p = 0.04

Summary Experiment 2

I lexical paradigmatic effects (family size, inflectional entropy)

modeled successfully without representations for inflections

and derivations

I the phrasal paradigmatic effect is also modelled correctly,

without representations for phrases

I the paradigmatic distributional properties of a word can affect

single-noun reading

Other results obtained

I phrasal frequency effects

I phonaestheme effects

I corn-corner effects (pseudoderived words)

I family size effects, whole-word frequency effects, and base

frequency effects for complex words

I the interaction between first-constituent frequency and

whole-word frequeny in compound words (Kuperman et al.,

2009)

I interaction of regularity by tense in English

intermezzo: strong connectivity

I mediated priming (Balota & Lorch, 1986)

I cat → cab → taxi

I lion → tiger → stripes

I priming chains for compounds?

I tea trolley → trolley bus

I tea trolley → trolley bus → bus stop

spreading activation: weak connectivity

soup

kitchen

pea

garden

maid

flour

nut

city

party

truck

mill

butter

case

house

shell

flower

hop

market

rock

roof

tea

winter

barchamber

dairy

hand

mere

milk

nurse

parlour

corn

betel

cob

coco

dough

earth

ginger

ground

kola

monkey

pig

thumb

wing


soup

kitchen

pea

garden

maid

flour

nut

city

party

truck

mill

butter

case

house

shell

flower

hop

market

rock

roof

tea

winter

barchamber

dairy

hand

mere

milk

nurse

parlour

corn

betel

cob

coco

dough

earth

ginger

ground

kola

monkey

pig

thumb

wing


soup

kitchen

pea

garden

maid

flour

nut

city

party

truck

mill

butter

case

house

shell

flower

hop

market

rock

roof

tea

winter

barchamber

dairy

hand

mere

milk

nurse

parlour

corn

betel

cob

coco

dough

earth

ginger

ground

kola

monkey

pig

thumb

wing


soup

kitchen

pea

garden

maid

flour

nut

city

party

truck

mill

butter

case

house

shell

flower

hop

market

rock

roof

tea

winter

barchamber

dairy

hand

mere

milk

nurse

parlour

corn

betel

cob

coco

dough

earth

ginger

ground

kola

monkey

pig

thumb

wing

spreading activation: strong connectivity

box

brush

cock

field

fly

gear

hair

horse

net

oil

paint

palm

paper

piece

shirtsilk

tail

wood

work

worm


box

brush

cock

field

fly

gear

hair

horse

net

oil

paint

palm

paper

piece

shirtsilk

tail

wood

work

worm


box

brush

cock

field

fly

gear

hair

horse

net

oil

paint

palm

paper

piece

shirtsilk

tail

wood

work

worm


box

brush

cock

field

fly

gear

hair

horse

net

oil

paint

palm

paper

piece

shirtsilk

tail

wood

work

worm


box

brush

cock

field

fly

gear

hair

horse

net

oil

paint

palm

paper

piece

shirtsilk

tail

wood

work

worm

is strong connectivity advantageous?

I is strong connectivity advantageous?

I possibly yes — more integrated learning

I possibly no — might cause confusion secondary family size

I this kind of connectivity should be beyond what the naive

discriminative reader can handle — but it isn’t

lexical connectivity

1 2 3 4

−2

−1

01

2

Head Family Size

Sec

onda

ry P

rodu

ctiv

ity

6.6

6.62

6.62

6.6

4

6.64 6.64

6.6

6

6.66

6.68

6.68

6.7

6.72

not in Strongly Connected Component

observed RTs

1 2 3 4

−2

−1

01

2

Head Family SizeS

econ

dary

Pro

duct

ivity

6.6

6.65

6.65

6.7

6.7

6.75

in Strongly Connected Component

observed RTs

1 2 3 4

−2

−1

01

2

Head Family Size

Sec

onda

ry P

rodu

ctiv

ity

0.95

1

1.0

5

1.1

1.1

1.1

5 1

.2

1.2

5

1.3 1.35 1.4

1.45

1.5

not in Strongly Connected Component

simulated RTs

1 2 3 4

−2

−1

01

2

Head Family Size

Sec

onda

ry P

rodu

ctiv

ity

0.9

1 1.1

1.2

1.2

1.3 1.4

1.5

1.6

in Strongly Connected Component

simulated RTs

Experiment 3: More on relative entropy in syntax

I reading aloud combined with eye tracking

I first experiment: reading aloud single words

(e.g., table)

I second experiment: reading aloud prepositional phrases

(e.g., on the + table)

Experiment 3: single words, total fixation time

0.01 0.02 0.03 0.04 0.05 0.06 0.07

1440

1460

1480

1500

1520

relative entropy (indefinite article)

tota

l fix

atio

n tim

e

0.69

Fre

quen

cy

4.73

5.78

6.62

10.3

Experiment 3: phrases, total fixation time

0.02 0.04 0.06 0.08 0.10

5.6

5.7

5.8

5.9

6.0

6.1

relative entropy (definite article)

log

first

fixa

tion

dura

tion

0.69

Fre

quen

cy

4.73

5.78

6.62

10.3

Naive discriminative and mixed-effects classifiers

Word Frequency Case Lemma Relative Ranef Stem Support Stem Support Exponent

Form Entropy Nominative Genitive Support

AQEa 10 nom A 0.134 -1.121 -0.014 0.260 0.353

AQEi 20 gen A 0.134 -1.121 -0.014 0.260 0.740

AQEu 30 acc A 0.134 -1.121 -0.014 0.260 0.595

AQEa 40 acc A 0.134 -1.121 -0.014 0.260 0.127

ABCa 15 nom B 0.053 -0.676 0.037 0.260 0.353

ABCi 22 gen B 0.053 -0.676 0.037 0.260 0.740

ABCu 28 acc B 0.053 -0.676 0.037 0.260 0.595

ABCa 35 acc B 0.053 -0.676 0.037 0.260 0.127

APQa 20 nom C 0.010 -0.288 0.087 0.260 0.353

APQi 24 gen C 0.010 -0.288 0.087 0.260 0.740

APQu 26 acc C 0.010 -0.288 0.087 0.260 0.595

APQa 30 acc C 0.010 -0.288 0.087 0.260 0.127

ZPEa 30 nom D 0.007 0.243 0.162 0.260 0.353

ZPEi 26 gen D 0.007 0.243 0.162 0.260 0.740

ZPEu 24 acc D 0.007 0.243 0.162 0.260 0.595

ZPEa 25 acc D 0.007 0.243 0.162 0.260 0.127

EPBa 35 nom E 0.039 0.583 0.210 0.260 0.353

EPBi 28 gen E 0.039 0.583 0.210 0.260 0.740

EPBu 22 acc E 0.039 0.583 0.210 0.260 0.595

EPBa 20 acc E 0.039 0.583 0.210 0.260 0.127

DPBa 40 nom F 0.139 1.269 0.289 0.260 0.353

DPBi 30 gen F 0.139 1.269 0.289 0.260 0.740

DPBu 20 acc F 0.139 1.269 0.289 0.260 0.595

DPBa 10 acc F 0.139 1.269 0.289 0.260 0.127

stem support, random intercepts, and unsigned

relative entropy

●

●

●●

●

●

−1.0 0.0 0.5 1.0

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Random Intercept

Rel

ativ

e E

ntro

py

R−squared = 0.989

F(2,3) = 230.9, p = 0.0005

●

●

●●

●

●

0.00 0.10 0.20 0.30

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Stem Support for Nominative

Rel

ativ

e E

ntro

py

R−squared = 0.993

F(2,3) = 348, p = 0.0003

●

●

●

●

●

●

0.00 0.10 0.20 0.30

−1.

0−

0.5

0.0

0.5

1.0

Stem Support for NominativeR

ando

m In

terc

ept

R−squared = 0.998

F(1,4) = 2590, p < 0.0001

the main trend depends on the balance

●

●

●

●

●

●

−0.6 −0.2 0.2

0.04

0.06

0.08

0.10

0.12

0.14

0.16

random intercept

stem

sup

port

●

●

●

●

●

●

−0.6 −0.2 0.2

0.00

0.02

0.04

0.06

random intercept

rela

tive

entr

opy

●

●

●

●

●

●

0.00 0.04

1.0

1.1

1.2

1.3

1.4

relative entropysi

mul

ated

RT

c(10,20,30,40)*20,

c(15,24,32,40)*10,

c(20,28,33,40)*3,

c(25,32,35,40)*2,

c(30,34,37,40)*1,

c(35,37,38,40)*1

trend depends on position prototype

●

●

●

●

●

●

−0.4 −0.2 0.0 0.2 0.4

0.02

0.04

0.06

0.08

0.10

0.12

0.14

random intercept

stem

sup

port

●

●

●

●

●●

−0.4 −0.2 0.0 0.2 0.4

0.00

0.02

0.04

0.06

0.08

random intercept

rela

tive

entr

opy

●

●

●

●

●

●

0.00 0.04 0.08

0.80

0.85

0.90

0.95

1.00

1.05

1.10

1.15

relative entropy

sim

ulat

ed R

T

c(10,20,30,40)*1,

c(15,24,32,40)*1,

c(20,28,33,40)*2,

c(25,32,35,40)*3,

c(30,34,37,40)*10,

c(35,37,38,40)*20

trend depends on position prototype

I in a complex system, the same measure can have slopes with

opposite signs depending on the distributional properties of

the language input

I this may help explain the changes in sign of RE in the

eye-tracking+naming study

I our distributional measures provide partial and

potentially distorting views on the complex structure

arising from simple principles of learning

Discussion

I Our model shows morphological effects in the absence of

morphological representations, including paradigmatic effects

I This is consistent with a-morphous views on morphology

(e.g.: Anderson, 1992; Blevins, 2003)

I The model is a classifier (for the dative alternation, it

outperforms mixed models)

I relative entropies are functionally equivalent to unsigned

random intercepts in a mixed-effects model

I relative entropies capture the total association strengths from

stems to grammatical meanings

Discussion

I Our model is similar in spirit to the reading part of the

triangle model (Seidenberg & Gonnermann, 2000)

I Both models map orthography onto semantics without

morphological representations

I Our computational engine, however, is much simpler than

that of the triangle model: we do not assume hidden layers or

use back-propagation to estimate connection weights.

I Furthermore, our model is more radically a-morphous in that

there is no hidden layer that can covertly represent

morphology.

Discussion

I Our model is also similar in spirit to the Bayesian Reader

(Norris, 2006)

I Both models map forms onto ‘central’ representations without

intercession by morphemes

I Our computational engine, however, is much simpler than that

of the Bayesian reader: the complexity of the Bayesian reader

is quadratic in the number of orthographic ‘units’, whereas

our model is linear in the number of elementary meanings

Summary

I Discriminative learning provides a good fit to a wide range

of experimental data

I The model is trained on realistic input, it is as sparse as

possible in its number of representations, and it is

computationally efficient

I The model does not make an a priori distinction between

phrasal learning and morphological learning, and therefore can

straightforwardly handle gradient phenomena at the interface

of morphology and syntax (cf. construction morphology, Booij

2010)