HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Morpho Challenge in Pascal...

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Morpho Challenge in Pascal Challenges Workshop

Venice, 12 April 2006

Morfessor in the Morpho Challenge

Mathias Creutz and Krista Lagus

Helsinki University of Technology (HUT)Adaptive Informatics Research Centre

2HUT

Challenge for NLP: too many words

• E.g., Finnish words often consist of lengthy sequences of morphemes — stems, suffixes and prefixes:– kahvi + n + juo + ja + lle + kin

(coffee + of + drink + -er + for + also)

– nyky + ratkaisu + i + sta + mme(current + solution + -s + from + our)

– tietä + isi + mme + kö + hän(know + would + we + INTERR + indeed)

Huge number of word forms, few examples of each By splitting we get fewer basic units, each with more

examples Important to know the inner structure of words

Source: Creutz & Lagus, 2005 tech.rep.

3HUT

Solution approaches• Hand-made morphological analyzers (e.g.,

based on Koskenniemi’s TWOL = two-level morphology)+ accurate– labour-intensive construction, commercial, coverage,

updating when languages change, addition of new languages

• Data-driven methods, preferably minimally supervised (e.g., John Goldsmith’s Linguistica)+ adaptive, language-independent– lower accuracy– many existing algorithms assume few morphemes per

word, unsuitable for compounds and multiple affixes

4HUT

Goal: segmentation

• Learn representations of– the smallest individually meaningful units of

language (morphemes)– and their interaction– in an unsupervised and data-driven manner

from raw text– making as general and as language-

independent assumptions as possible.

• Evaluate– against a gold-standard morphological

analysis of word forms– integrated in NLP applications (e.g. speech

recognition)

Hutmegs

Morfessor

5HUT

Further challenges in morphology learning

• Beyond segmentation: allomorphy (“foot – feet, goose – geese”)

• Detection of semantic similarity (“sing – sings – singe – singed”)

• Learning of paradigms (e.g., John Goldsmith’s Linguistica)

believhopliv

movus

eedesing

6HUT

Linguistic evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard)

• Hutmegs contains gold standard segmentations obtained by processing the morphological analyses of FinTWOL and CELEX

– 1.4 million Finnish word forms (FInTWOL, from Lingsoft Inc.)

• Input: ahvenfileerullia (perch filet rolls)• FINTWOL: ahven#filee#rulla N PTV PL• Hutmegs: ahven + filee + rull + i + a

– 120 000 English word forms (CELEX, from LDC)• Input: housewives• CELEX: house wife, NNx, P• Hutmegs: house + wive + s

• Publicly available, see

M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English.

7HUT

Morfessor models in the Challenge

Morfessor Baseline (2002)• Program code available since 2002• Provided as a baseline model for the Morpho Challenge• Improves speech recognition; experiments since 2003• No model of morphotacticsMorfessor Categories ML (2004)• Category-based modeling (HMM) of morphotactics• No speech recognition experiments before this challenge• No public software yet

Morfessor Categories MAP (2005)• More elegant mathematically

M1

M2

M3

8HUT

Avoiding overlearning by controlling model complexity• When using powerful machine learning

methods, overlearning is always a problem

• Occam’s razor: given two equally accurate theories, choose the one that is less complex

We have used:• Heuristic control affecting the size of the

lexicon• Deriving a cost function that incorporates a

measure of model size, using– MDL (Minimum Description length)– MAP learning (Maximum A Posteriori)

M2

M3

M1

9HUT

Morfessor Baseline• Originally called the ”Recursive MDL method”• Optimizes roughly:

+ MDL based cost function optimizes size of the model- Morph contextual information not utilized

• undersegmentation of frequent strings (“forthepurposeof”)

• oversegmentation of rare strings (“in + s + an + e”)• syntactic / morphotactic violations (“s + can”)

M1

P (M | corpus ) P (M) P (corpus | M)

where M = (lexicon, grammar) and therefore

= P (lexicon) P (corpus | lexicon)

= P () P () letters morphs

10HUT

Search for the optimal model

Recursive binary splitting

reopened openminded

reopen minded

re open mind ed Morphs

conferences

openingwordsopenminded

reopened

Randomlyshuffle words

Convergenceof model prob.?

yesno Done

M1

11HUT

Challenge Results: Comparison to gold standard splitting (F-measures)

0

10

20

30

40

50

60

70

80

90

Finnish Turkish English

"Sumaa"

BernhardB

A3

BordagC

BordagLsv

Rehman

"RePortS"

Bonnier

A8

"Pacman"

Johnsen

"Swordfish"

Cheat

M1:Baseline

Bernhard

MorfessorBaseline: M1

Winners

12HUT

Morfessor- Categories – ML & MAP• Lexicon / Grammar dualism

– Word structure captured by a regular expression: word = ( prefix* stem suffix* )+

– Morph sequences (words) are generated by a Hidden Markov model (HMM):

– Lexicon: morpheme properties and contextual properties

– Morph segmentation is initialized using M1

P(STM | PRE) P(SUF | SUF)

ificover ationsimpl# s #

P(’s’ | SUF)P(’over’ | PRE)

Transition probs

Emission probs

M2M3

13HUT

Morph lexicon

Morph distributional features Form

14029

136 1 4 over

41 4 1 5 simpl

17259

1 4618 1 s

Freq

uency

Length

String

...

Right p

erplex

ity

Left

perplex

ity

Morp

hs

M2M3

14HUT

How morph distributional features affect morph categories

0

0,2

0,4

0,6

0,8

1

1,2

10 30 50 70 90

Left perplexity

Su

ffix

-lik

en

es

s

0

0,2

0,4

0,6

0,8

1

1,2

10 30 50 70 90

Right perplexity

Pre

fix

-lik

en

es

s

0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5 6 7 8 9 1

Morph length

Ste

m-l

ike

ne

ss

• Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’)

• Assume asymmetries between the categories:

15HUT

• There is an additional non-morpheme category for cases where none of the proper classes is likely:

P(NON |'over')

1 Prefixlike ('over') 1 Stemlike('over') 1 Suffixlike('over')

P(PRE |'over') Prefixlike('over')q 1 P(NON |'over')

Prefixlike('over')q Stemlike('over')q Suffixlike('over')q

• Distribute remaining probability mass proportionally, e.g.,

How distributional features affect categories (2)

16HUT

MAP vs. ML optimization

argmaxLexicon

P(Lexicon | Corpus)

argmaxLexicon

P(Corpus | Lexicon)P(Lexicon)

Morfessor Categories-MAP:

14029

136 1 4 over

41 4 1 5 simpl

17259

1 4618 1 s

...

P(STM | PRE) P(SUF | SUF)

ificover ationsimpl# s #

P(’s’ | SUF)P(’over’ | PRE)

M3

Morfessor Categories-ML:

arg max P(Corpus | Lexicon) Lexicon

M2 Control lexicon size

heuristically

17HUT

Hierarchical structures in lexicon

straightforwardness

straightforward ness

straight

for

forward

Stem

Suffix

M3

ward

Non-morpheme

Maintain the hierarchy of splittings for each word

Ability to code efficiently also common substrings which are not morphemes (e.g. syllables in foreign names)

Bracketed output

18HUT

Example segmentations

Finnish English

[ aarre kammio ] issa [ accomplish es ]

[ aarre kammio ] on [ accomplish ment ]

bahama laiset [ beautiful ly ]

bahama [ saari en ] [ insur ed ]

[ epä [ [ tasa paino ] inen ] ]

[ insure s ]

maclare n [ insur ing ]

[ nais [ autoili ja ] ] a [ [ [ photo graph ] er ] s ]

[ sano ttiin ] ko [ present ly ] found

töhri ( mis istä ) [ re siding ]

[ [ voi mme ] ko ] [ [ un [ expect ed ] ] ly ]

M3

19HUT

Challenge Results: Comparison to gold standard splitting (F-measures)

0

10

20

30

40

50

60

70

80

90

Finnish Turkish English

"Sumaa"

BernhardB

A3

BordagC

BordagLsv

Rehman

"RePortS"

Bonnier

A8

"Pacman"

Johnsen

"Swordfish"

Cheat

M1:Baseline

M2

M3

CheatB

CheatC

Bernhard

Morfessor Categories models: M2 and M3

MorfessorBaseline: M1

Committees

Winner

20HUT

Morfessor results: closer look

M3

21HUT

Speech recognition results: Finnish

0,8

1

1,2

1,4

1,6

1,8

2

Letter error rate %

"Sumaa"

BernhardA

BernhardB

BordagComp

BordagLSV

Rehman

Bonnier

"Pacman"

"Swordfish"

"Cheat"

M1:Baseline

M2

M3

"CheatB"

"CheatC"6

7

8

9

10

11

12

13

14

Word error rate %

Morfessors:M1, M2, M3

CommitteesCompetitors

22HUT

Speech recognition results: Turkish

12

12,5

13

13,5

14

14,5

15

15,5

16

16,5

17

Letter error rate %

Sumaa

BernhardA

BernhardB

BordagComp

BordagLSV

Rehman

Bonnier

Pacman

Swordfish

Cheat

M1:Baseline

M2

M3

CheatB

CheatC

30

35

40

45

50

55

Word error rate %

Morfessors: M1, M2, M3

Committees

23HUT

A reason for differences?

Source: Creutz & Lagus, 2005 tech.rep.

24HUT

Discussion• This was the first time our Category methods

were evaluated in speech recognition, with nice results!

• Comparison with Morfessors and challenge participants is not quite fair

• Possibilities to extend the M3 model– add word contextual features for “meaning”– more fine-grained categories– beyond concatenative phenomena (e.g., goose –

geese)– allomorphy

(e.g., beauty, beauty + ’s, beauti + es, beauti + ful)

25HUT

Questions for the Morpho Challenge

• How language-general in fact are the methods?– Norwegian, French, German, Arabic, ...

• Did we, or can we succeed in inducing ”basic units of meaning”? – Evaluation in other NLP problems: MT, IR, QA,

TE, ...– Application of morphs to non-NLP problems?

Machine vision, image analysis, video analysis ...

• Will there be another Morpho Challenge?

26HUT

See you in another challenge!

best wishes, Krista (and Sade)

27HUT

Muistiinpanojani– kuvaa lyhyesti omat menetelmät– pohdi omien menetelmien eroja suhteessa niiden

ominaisuuksiin– ole nöyrä, tuo esiin miksi vertailu on epäreilu (aiempi

kokemus; oma data; ja puh.tunnistuskin on tuttu sovellus, joten sen ryhmän aiempi tutkimustyö on voinut vaikuttaa menetelmänkehitykseemme epäsuorasti)

+ pohdintaa meidän menetelmien eroista?+ esimerkkisegmentointeja kaikistamme?+ Diskussiokamaa Mikon paperista ja meidän paperista+ eka tuloskuva on nyt sekava + värit eri tavalla kuin

muissa: vaihda värit ja tuplaa, nosta voittajaa paremmin esiin

28HUT

Discussion

• Possibility to extend the model– rudimentary features used for “meaning”– more fine-grained categories– beyond concatenative phenomena (e.g., goose –

geese)– allomorphy

(e.g., beauty, beauty + ’s, beauti + es, beauti + ful)

• Already now useful in applications– automatic speech recognition (Finnish, Turkish)

29HUT

Competition 1: Comparing Morfessors

30HUT

Overview of methods• Machine learning methodology• FEATURES USED:information on

– morph contexts (Bernard, Morfessor)– word contexts (Bordag)

31HUT

Morpho project pagehttp://www.cis.hut.fi/projects/morpho/

krista lagus:krista lagus:

32HUT

Demo 6

http://www.cis.hut.fi/projects/morpho/

33HUT

Demo 7

Date post:	14-Dec-2015
Category:	Documents
Upload:	tommy-treen
View:	216 times
Download:	1 times

HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Morpho Challenge in Pascal...

Documents