+ All Categories
Home > Documents > HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE...

HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE...

Date post: 20-Dec-2015
Category:
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
42
HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006 Mikko Kurimo , Mathias Creutz, Krista Lagus
Transcript
Page 1: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

LABORATORY OF COMPUTER AND INFORMATION SCIENCE

ADAPTIVE INFORMATICS RESEARCH CENTRE

Unsupervised Segmentation of Words into MorphemesMorpho Challenge Workshop 2006

Mikko Kurimo, Mathias Creutz, Krista Lagus

Page 2: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Opening – Welcomes

Welcome to the Morphochallenge workshop, everybody!

• challenge participants• workshop speakers• other PASCAL researchers• others interested in the topic

Page 3: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Motivation

To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes.

Get basic vocabulary units suitable for different tasks:

• Speech and text understanding• Machine translation• Information retrieval• Statistical language modellingRule based systems can split: read + ing, but have

difficulties for complicated words and languages

Page 4: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Workshop 12 April, final timetable

0900 Opening0910 Introduction and evaluation report0950 Invited talk by Richard Sproat1050 Break1120 Morfessor baseline by Krista Lagus1150 Competitors presentations1230 Lunch1400 Competitors (contd.)1500 Discussion1530 Conclusion

Page 5: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Morning session

09:10 Mikko KurimoIntroduction and Evaluation report

09:50 Prof. Richard Sproat (Invited Talk) University of Illinois at Urbana-Champaign ”Computational Morphology and its

Implications for the Theoretical Morphology”

10:50 – 11:20 Coffee break

Page 6: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Noon session

11:20 Krista Lagus: "Morfessor in MorphoChallenge"

11:50 Delphine Bernhard: "Morphological segmentation for the automatic acquisition of semantic relationships in the context of MorphoChallenge 2005"

12:10 Stefan Bordag: "Two-step approach to unsupervised morpheme segmentation"

12:30 – 14:00 Lunch

Page 7: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Afternoon session

14:00 Lars Johnsen: "Learning morphology on tokens" 14:20 Samarth Keshava and Emily Pitler: "Reports - Quick and Simple Unsupervised

Learning of Morphemes" 14:40 Eric Atwell (Mikko Kurimo): "Combinatory Hybrid Elementary Analysis of

Text" 15:00 Discussion 15:30 Conclusion

Page 8: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Discussion topics for afternoon

• New ways to evaluate the obtained units ?• New evaluation languages: German,

Norwegian, French, Estonian, Arabic,..?• Other application evaluations: SLU, IR,

MT,..?• New organizer partners ?• MorphoChallenge2 ?• Journal special issue ?• 2nd Morpho Challenge workshop ?• ?

Page 9: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Opening - Thanks

Thanks to all who made Morpho Challenge possible!

• PASCAL network, coordinators, challenge program organizers

• Morpho Challenge organizing committee• Morpho Challenge program committee• Morpho Challenge participants• Morpho Challenge evaluation team• Challenge workshop organizers

Page 10: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Let’s start.

It is my pleasure to welcome the first speaker, who is...

Page 11: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

LABORATORY OF COMPUTER AND INFORMATION SCIENCE

ADAPTIVE INFORMATICS RESEARCH CENTRE

Morpho Challenge – Introduction and evaluation report

Mikko Kurimo, Mathias Creutz, Matti Varjokallio (Helsinki, FI)

Ebru Arisoy, Murat Saraclar (Istanbul, TR)

Page 12: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Contents

1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion

Page 13: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Motivation

To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes.

Get basic vocabulary units suitable for different tasks:

• Speech and text understanding• Machine translation• Information retrieval• Statistical language modelling

Page 14: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Motivation

The scientific goals of this challenge are:• To learn of the phenomena underlying word

construction in natural languages• To discover approaches suitable for a wide

range of languages• To advance machine learning methodology

Page 15: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Contents

1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion

Page 16: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Call for participation

• Part of the EU Network of Excellence PASCAL’s Challenge Program

• Participation is open to all and free of charge• Word sets are provided for three languages:

Finnish, English, and Turkish • Implement an unsupervised algorithm that

segments the words of each language!• No language-specific tweaking parameters,

please• Write a paper that describes your algorithm

Page 17: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Rules

• Segmented words are submitted to the organizers

• Two different evaluations are made• Competition 1: Comparison to a linguistic

morpheme segmentation "gold standard“• Competition 2: Speech recognition

experiments, where statistical n-gram language models utilize the morphemes instead of entire words.

Page 18: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Datasets

• Word lists are downloadable at our home page• Each word in the list is preceded by its frequency • Finnish: newspapers, books, newswires: 1.6/32M• Turkish: web, newspapers, sports news: 0.6/17M• English: Gutenberg, Gigaword, Brown: 170k/24M• Small gold standard sample in each language

Page 19: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

ParticipantsA1 Choudri and Dang, Univ. Leeds, UKA2 a,b, Bernhard, TIMC-IMAG, FA3 'A.A.‘ Ahmad and Allendes, Univ. Leeds, UKA4 ‘comb’,’lsv’, Bordag, Univ. Leipzig, DA5 Rehman and Hussain, Univ. Leeds, UKA6 'RePortS‘, Pitler and Keshava, Univ. Yale, USAA7 Bonnier, Univ. Leeds, UKA8 Kitching and Malleson, Univ. Leeds, UKA9 'Pacman‘, Manley and Williamson, Univ. Leeds, UKA10 Johnsen, Univ. Bergen, NOA11 'Swordfish‘, Jordan, Healy and Keselj, Univ.

Dalhousie, CAA12 'Cheat‘, Atwell and Roberts, Univ. Leeds, UKM1-3 Morfessor, Categories-ML, MAP, Helsinki Univ.

Tech, FI

Page 20: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Contents

1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion

Page 21: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Competition 1: Word segmentation

• Two samples : boule_vard , cup_bearer_s‘• Gold standard: boulevard , cup_bear_er_s_‘• 2 correct hits (H), 1 insertion (I), 2 deletions (D)• Precision = H / (H + I) = 2 / (2 + 1) = 0.67• Recall = H / (H + D) = 2 / (2 + 2) = 0.50• F-Measure = harmonic mean of precision and

recall = 2H / (2H + I + D) = 4 / (4 + 1 + 2) = 0.57• A secret (random)10% subset of words evaluated• Morfessor Baseline: 54.2% FI, 51.3% TR, 66.0 EN

Page 22: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Results: F-measure in Finnish data

202530354045505560657075

Finnish

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Page 23: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

F-measure with reference algorithms

202530354045505560657075

Finnish

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Morfess.

MorfML

MorfMAP

C-All

C-Top5

Page 24: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

F-measure in Turkish data

202530354045505560657075

Turkish

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Page 25: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

F-measure with reference algorithms

202530354045505560657075

Turkish

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Morfess.

MorfML

MorfMAP

C-All

C-Top5

Page 26: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

F-measure in English data

303540

45505560

65707580

English

Choudri

BernhA

BernhB

Ahmad

BordagC

Rehman

Pitler

Bonnier

Kitching

Manley

Johnsen

Jordan

Atwell

Page 27: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

F-measure with reference algorithms

30

3540455055

6065707580

English

Choudri

BernhA

BernhB

Ahmad

BordagC

Rehman

Pitler

Bonnier

Kitching

Manley

Johnsen

Jordan

Atwell

Morfess.

MorfML

MorfMAP

C-All

C-Top5

Page 28: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

F-measure, the 3 languages task

202530354045505560657075

Finnish Turkish English

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Page 29: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

...with reference algorithms

202530354045505560657075

Finnish Turkish English

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Morfess.

MorfML

MorfMAP

Page 30: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Contents

1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion

Page 31: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Competition 2: Language modeling

• A statistical N-gram LM trained for the obtained morphemes using a large text corpus

• Growing N-gram model for Finnish by HUT tools

• 4-gram model for Turkish using SRILM• Free lexicon size (40´000 – 700´000)• ~10M N-grams (Finnish) or 50-70M bytes

(Turkish)

Page 32: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Evaluation by speech recognition

• Realistic benchmark application: Continuous reading of large-vocabulary texts (books and news)

• Letter error rate LER% = (sub + ins + del) / letters• Baseline systems using LMs of Morfessor’s segments• Finnish recognizer made at HUT (HUT tools): speaker-

dep., running speed 10-15 xRT, baseline 1.31% LER• Turkish made at Bogazici Univ. (HTK and AT&T tools):

speaker-indep., running 2-3 xRT, baseline 13.7% LER

Page 33: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Speech recognition letter error rate (LER)

11

11.512

12.513

13.514

14.515

15.516

Finnish*10 Turkish*1

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Page 34: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

LER for reference algorithms

1010.5

1111.5

1212.5

1313.5

1414.5

1515.5

16

Finnish*10 Turkish*1

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Morfess.

MorfML

MorfMAP

C-All

C-Top5

Rover

Page 35: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

LER for grammatic rules and words, too

1010.5

1111.5

1212.5

1313.5

1414.5

1515.5

16

Finnish*10 Turkish*1

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Morfess.

MorfML

MorfMAP

C-All

C-Top5

Rover

GoldStd

Words

s

Page 36: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Update for Turkish results NEW

1010.5

1111.5

1212.5

1313.5

1414.5

15

Turkish pruned Turkish full LM

Choudri

BernhA

BernhB

BordagC

Rehman

Bonnier

Manley

Jordan

Atwell

Morfess.

MorfML

MorfMAP

C-All

C-Top5

Rover

GoldStd

Words

s

Page 37: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Contents

1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion

Page 38: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Conclusion

The scientific goals of this challenge are:• To learn of the phenomena underlying word

construction in natural languages• To discover approaches suitable for a wide

range of languages• To advance machine learning methodology

Page 39: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Conclusion

• 14 different unsupervised segmentation algorithms

• 12 participating research groups• Evaluations for 3 languages• Full report and papers in the proceedings• Website:

http://www.cis.hut.fi/morphochallenge2005

Page 40: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Acknowledgments

• Text and speech data providers in all languages!

• Finnish and Turkish evaluation teams• Funding from PASCAL, Finnish Academy,

Lang. Tech. Grad school, HUT, and Bogazici Univ.

• LM and ASR tools in HUT, SRI, and AT&T• Competition participants!

Page 41: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

The second speaker today :

Professor Richard Sproat, University of Illinois at Urbana-Champaign:

”Computational Morphology and its Implications for the Theoretical Morphology”

Page 42: HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE ADAPTIVE INFORMATICS RESEARCH CENTRE Unsupervised Segmentation of Words.

HELSINKI UNIVERSITY OF TECHNOLOGY

ADAPTIVE INFORMATICS RESEARCH CENTRE

Richard Sproat

Professor of Linguistics and Electrical and Computer Engineering at the University of Illinois and head of the Computational Linguistics Lab at the Beckman Institute.

Received his Ph.D. from MIT in 1985 and has since then worked also at AT&T Bell Labs.

A well-known expert in language and computational linguistics, including syntax, morphology, computational morphology, articulatory and acoustic phonetics, text processing, text-to-speech synthesis, writing systems, and text-to-scene conversion.


Recommended