+ All Categories
Home > Documents > LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf...

LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf...

Date post: 25-Dec-2015
Category:
Upload: alvin-carpenter
View: 219 times
Download: 3 times
Share this document with a friend
Popular Tags:
37
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI
Transcript
Page 1: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

LDMT MURI

Data Collection and Linguistic Annotations

November 4, 2011Jason Baldridge, UT Austin

Ulf Hermjakob, USC/ISI

Page 2: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Purpose

Collect and build data• Monolingual text• Bilingual text• Linguistic annotations

to support work on machine translations for • Kinyarwanda-English• Malagasy-English

Page 3: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Overview

• Source, type and size of data• Language consultants• Kinyarwanda data• Malagasy data• Annotation• An idea• Accomplishments, challenges, future releases

Page 4: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Text sources

• Bible (highly multilingual parallel corpus)• Dictionaries, phrasebooks• Interview transcripts• Newspapers

Page 5: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Pbook (0.9k) Pbook (0.7k)

GWord (8b)

BILINGUAL(16k)

ENGLISHmonolingual

(huge)

KINYARWANDAmonolingual

(7m)

ENGtreebank

ENGtext

KINtext

KINtreebank

PTB (1m)

wordalign

Kinyarwanda Data Resources

1.0 Release2.0 Release

News (7m)

KGMC (5.8k) KGMC (4.8k)Dict (9k) Dict (8k)N

ON

E

wordcounts

Page 6: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

KGMC (270k) KGMC (225k)

Pbook (0.9k) Pbook (0.7k)

GWord (8b)

BILINGUAL(285k)

ENGLISHmonolingual

(huge)

KINYARWANDAmonolingual

(7m)

ENGtreebank

ENGtext

KINtext

KINtreebank

PTB (1m)

wordalign

Kinyarwanda Data Resources

1.0 Release2.0 Release

News (7m)

KGMC (5.8k) KGMC (4.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.1k) IGT (0.06k)

Dict (9k) Dict (8k)NO

NE KGMC (2.9k)KGMC (3.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.06k)IGT (0.1k)

NOTE: no goldmorph-split text

wordcounts

Page 7: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Bible (730k) Bible (725k)

Gword (8b)

BILINGUAL(730k)

ENGLISHmonolingual

(huge)

MALAGASYmonolingual

(zero)

ENGtreebank

ENGtext

MLGtext

MLGtreebank

PTB (1m)

wordalign

Malagasy Data Resources

1.0 Release2.0 Release

NO

NE

none

Page 8: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Bible (730k) Bible (725k)

News (2.1k) News (2.3k)

Gword (8b)

BILINGUAL(732k)

ENGLISHmonolingual

(huge)

MALAGASYmonolingual

(zero)

ENGtreebank

ENGtext

MLGtext

MLGtreebank

PTB (1m)

wordalign

Malagasy Data Resources

1.0 Release2.0 Release

NO

NE

none

News (2.1k) News (2.3k)

NOTE: no goldmorph-split text

Page 9: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Quality of Original Texts• Perfectly clean: English Bible• Reasonably edited: Newspapers (kin/mlg)• Uneven editing: Genocide protocols– Spelling errors– missing/sloppy punctuation– untranslated text (missing or still in source language)

Kinyarwanda word ikaragiro (which means dairy) repeatedly translated as diary. “... over there, the houses that belong to the diary.”

Page 10: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Native speaker consultants

• UT reached out to speakers of both languages• Kinyarwanda– Several speakers near Austin– Most would like some payment– One has helped with translation and consultation

• Malagasy speakers– Many speakers from around US and Canada– Most would like some payment– Two have helped with translations

Page 11: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Native speaker consultants

• At this point, UT does need to have access to paid informants.– Need texts from other genres translated– Need to ask questions about meanings of some

sentences for linguistic analysis

• The CMU-Rwanda initiative may provide us with a further avenue for obtaining consultants for Kinyarwanda. – Also a potential source of data

Page 12: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Overview

• Source, type and size of data• Language consultants• Kinyarwanda data• Malagasy data• Annotation• An idea• Accomplishments, challenges, future releases

Page 13: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

KGMC Transcripts

• Collaboration between Kigali Genocide Memorial Center and the Human Rights Documentation Initiative at UT Austin Library– http://www.lib.utexas.edu/hrdi/– http://www.kigalimemorialcentre.org

• Transcriptions of survivor testimonies filmed for the Genocide Archive Rwanda

http://www.genocidearchiverwanda.org.rw/index.php/Welcome_to_Genocide_Archive_Rwanda

http://www.genocidearchiverwanda.org.rw/index.php/Kmc00005-sub2-eng-glifos

Page 14: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

KGMC Data

• 48 translated transcripts– all translated into English– 33 into French

• 41 untranslated transcripts (only Kinyarwanda)

Page 15: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

KGMC Data• Original format: Microsoft Word, in tables

Page 16: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

KGMC Data normalization• Converted to XML using a semi-automatic

process• Each language represented side-by-side• Script to process the MS Word format– Iteratively modified based on output and error

detection– Needed to handle missing data and misalignments

between time spans across translations

• Final manual verification and correction of each file.

Page 17: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Example XML

Page 18: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Overview

• Source, type and size of data• Language consultants• Kinyarwanda data• Malagasy data• Annotation• An idea• Accomplishments, challenges, future releases

Page 19: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Malagasy Bible

• Online version of 1865 Malagasy Bible– http://www.madapourchrist.org/

• Preparation:– Convert HTML to text– Align with the NET Bible (New English Translation)

using verses– Currently have 686 chapters aligned

• Obvious problem: 150 year-old Malagasy text

Page 20: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Malagasy Dictionary

• Online dictionary of Malagasy– http://malagasyworld.org

• 63k words– English definitions for 8000 words– French definitions for 10,000 words

• Includes parts-of-speech, mostly coarse-grained (noun, verb, adjective, etc.)

Page 21: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Malagasy Dictionary

• Scraped and processed to produce clean XML

Page 22: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Malagasy texts

• Texts from six webpages– 3 from Lakroa: http://www.lakroa.mg/– 3 from Lagazette: http://www.lagazette-dgi.com/

• Translated by native speakers to English to create small parallel corpus for initial analysis and annotation.

Page 23: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Overview

• Source, type and size of data• Language consultants• Kinyarwanda data• Malagasy data• Annotation• An idea• Accomplishments, challenges, future releases

Page 24: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Morphological analysis

• UT Austin obtained and adapted XFST analyzer created by Dalrymple, Liakata and Mackie 2006.

• Applied it to the Malagasy website texts from Lakroa and Lagazette, hand-selecting the correct analysis for each word.

• These need to be integrated with the standard tokenization and data organization.

• Kinyarwanda morph analyzer in development.

Page 25: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Syntactic annotations

• Did initial pilot annotations with example sentences from the linguistics literature.

• Annotated KGMC (kin) and Lagazette and Lakroa (mlg) texts with phrase structures.– Used a fairly standard set of labels and structures– Trees created for both the source language

sentences and their English translations

Page 26: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Example KGMC tree

Page 27: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Example KGMC tree

Page 28: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Syntactic annotations• Phrase structures were created before

standardizing the tokenization; had to be grafted back onto correct tokens.

• Current trees are still pilot annotations! Need to do many things, including:– reconsider the choice of node labels– add head markers (enable easy conversion to

dependency analyses)– review and incorporate feedback from others– graft some existing trees to standard tokenization

Page 29: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Overview

• Source, type and size of data• Language consultants• Kinyarwanda data• Malagasy data• Annotation• An idea: data-driven dictionary development• Accomplishments, challenges, future releases

Page 30: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Data-driven Dictionary Development

• Current dictionary size is moderate– 6,632 entries with 3,890 distinct Kin. words/phrases– many relatively common words not covered

• Idea: increase dictionary size using translators– based on data analysis of monolingual corpora– using NLP techniques to leverage process

• Goals– Additional bitext for direct use in MT training– Improved resource for morphological analyzers

Page 31: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Data-driven Dict. Dev. (Example)

• Monolingual Kinyarwanda corpus contains– ikinini (43 occ.), ibinini (96 occ.); not in dictionary

• Automatically predict lexical form(s), POS– ikinini (noun, plural: ibinini)

• Elicit English translation: pill, tablet– providing examples from corpus in context

• Generate dictionary entry as well as MT bitext– ikinini=pill, ikinini=tablet, ibinini=pills, ibinini=tablets

Page 32: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Overview

• Source, type and size of data• Language consultants• Kinyarwanda data• Malagasy data• Annotation• An idea: data-driven dictionary development• Accomplishments, challenges, future releases

Page 33: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Accomplishments

• Released monolingual, bilingual, and tree-banked data for Kinyarwanda and Malagasy– Data release v1.0 in February 2011– Data release v2.0 in October 2011

• Tools that can be shared– Tokenizer for Kinyarwanda and Malagasy– Diagnostic tools to check encoding, character sets,

tokenization, tree well-formedness etc.

Page 34: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Challenges

• Need for more and better annotation tools to annotate faster and assure consistency– sentence segmentation, treebanking, ...

• Need guidelines, workflow for data acquisition and annotation process

• Need reliable language experts for Kinyarwanda and Malagasy

• Need more data Wikipedia, LDS, mlg/fre

Page 35: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Data release v2.1 (target: Dec. 2011)

• Full sentence-level segmentation on Kinyarwanda-English text

• Release tokenizers, morph analyzers, diagnostic tools

Page 36: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Data release v3.0 (target: May 2012)

Highest priority• In-domain bilingual test sets– 500 sentences (300 newswire, 200 conversation)– Naturally occurring, source texts on both sides– Multiple translation if possible

• Large, modern Malagasy monolingual corpora• Head markings (syntactic)• Word alignment

Page 37: LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Data release v3.0 (target: May 2012)

Next priority• Increase size of Kinyarwanda-English dictionary• More Malagasy-English news bitext• Typo correction• Bible in Kinyarwanda (?)• Malagasy-English dictionary• Morphological gold standard


Recommended