+ All Categories
Home > Technology > SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using...

SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using...

Date post: 11-May-2015
Category:
Upload: guy-de-pauw
View: 1,287 times
Download: 3 times
Share this document with a friend
Popular Tags:
22
SYNERGY A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking May 18th, 2010 Language Technologies Institute Carnegie Mellon University Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Transcript
Page 1: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

SYNERGYA Named Entity Recognition System for Resource-scarce

Languages such as Swahili using Online Machine Translation

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking

May 18th, 2010

Language Technologies InstituteCarnegie Mellon University

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 2: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

NER for Resource-scarce languages: Overview

I Named Entity Recognition (NER) is the task of finding namesin text, and optionally, classifying them into persons,organizations, locations, etc.

I Example: Wolff, currently a journalist in Argentina, playedwith Del Bosque in the final years of the seventies in RealMadrid.

I Vital first step for many problems such as parsing, co-referenceresolution, translation, indexing and semantic analysis.

I State-of-art NER systems rely on machine learning.

I Requires large amounts of labeled training data. Costly andtime-consuming.

I Upshot: For many widely spoken languages e.g. Swahili, noNER systems freely available.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 3: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Motivation

I Online Machine Translation (MT) systems such as GoogleTranslate and Microsoft Bing Translator support manyresource-scarce languages for which NER systems don’t exist.

I Google Translate supports Swahili ⇐⇒ English translation.

I Quality of translation is far from perfect.

I However, this might still be good enough for NER. Why?

I Observation: Named Entities (NEs) are preserved duringSwahili ⇐⇒ English translation.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 4: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Motivation: Named Entity Preservation - Example

All Named Entities shown in Red:

I Sample Swahili sentence: “Lundenga aliwataja mawakalaambao wameshatuma maombi kuwa ni kutoka mikoa ya IringaDodoma Mbeya Mwanza na Ruvuma.”

I English version from Google Translate: “Lundenga mentionedshatuma agents who have a prayer from the regions ofDodoma Iringa Mwanza Mbeya and Ruvuma.”

I Sample English sentence: “President Obama advised GermanChancellor Angela Merkel.”

I Swahili version from Google Translate: “Rais Obamawanashauriwa Ujerumani Angela Merkel.”

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 5: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

SYNERGY: Big Picture

1. Swahili text is translated to English.

2. The best off-the-shelf NER systems are applied to theresulting English text.

3. English NEs are mapped back to words in the Swahili textusing word alignment.

4. Post-Processing using dictionaries to improve performance

I Observation: SYNERGY addresses the problem of NER for anew language by breaking it into three relatively easierproblems.

I Observation: No longer need labeled training data.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 6: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Comparison to existing NER systems

I No Swahili NER system available, so how do we compareSYNERGY and its MT-based approach?

I Extend SYNERGY to perform NER for another language forwhich freely available NER systems do exist.

I We use Arabic, because it is typically considered a ‘hard’language for which to do NER.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 7: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Resources - Machine Translation

I Google Translate for Swahili and Arabic text. Microsoft BingTranslator not yet ready for Arabic, doesn’t support Swahili.

I Would like to use other systems, but for these languages, nonefreely available.

I Advantage: As soon as new languages become available onany MT system, SYNERGY can be used with few changes.

I Can also use MT systems automatically generated fromparallel data using freely available toolkits such as Moses.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 8: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Resources - Named Entities

I Swahili labeled data: Helsinki Corpus of Swahili.

I Arabic labeled data: ANERcorp.

I NER: Stanford’s CRF based NER system and UIUC’s LBJ NETagger. State-of-art in English NER.

I Use only off-the-shelf NERs to show that a good NER systemcan be developed for new languages quickly.

I Post-Processing: Swahili-English dictionary by Kamusi Projectand the Linguistic Data Consortium’s Arabic TreeBank.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 9: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Architecture: 1 - Translation to English

I Perl interface to Google Translate.

I Translate each sentence individually.

I Many sentences too long for Google Translate API, need to besplit. Named Entities unaffected by this.

I Produces a translated English document for each Swahilidocument.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 10: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Architecture: 2 - NER on Translated English Text

I SYNERGY uses a combination of Stanford and UIUC NERsystems called Union.

I Why? Observation: Stanford and UIUC NER systems don’talways make the same mistakes.

I Idea: Exploit this to improve performance.

System Precision Recall F1Stanford 0.962 0.963 0.963

UIUC 0.968 0.964 0.966

Union 0.952 0.985 0.968

Table: Performance of NER systems on CoNLL 2003 Test SetRushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 11: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Architecture: 3 - Word Alignment

I Alignment of English NEs with words in original Swahilidocument to discover Swahili NEs.

I Most challenging component of SYNERGY.

I We try two different algorithms:

1. Window Based Alignment.

2. GIZA++ Based Alignment.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 12: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Architecture: 3 - Word Alignment - Window Based

I Initial Approach:

1. Translate each English NE word back to a Swahili word

2. Search in the original Swahili document for a match within awindow of words.

I Produces very few matches.

I Modification:

1. Create a new English document by translating each Swahiliword one at a time

2. Search in this new document for English NE words within awindow of words.

3. Keep pointers from each English word in new document toSwahili word that produces it.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 13: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Architecture: 3 - Word Alignment - GIZA++ Based

I GIZA++ is state-of-art for word alignment in MachineTranslation.

1. Takes an input language corpus and an output languagecorpus, and automatically generates word classes for both.

2. For each sentence in the input language corpus finds the mostprobable alignment of the corresponding output languagesentence.

I SYNERGY: The translated English documents serve as theinput corpus and the Swahili documents as the output corpus.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 14: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Architecture: 4 - Post-Processing

I Compile POS tagged English, Swahili and Arabic dictionaries.I Divide into Named Entity (NE) and Non-Named Entity

(NNE) sections. A word may occur in both sections.

1. If a word labeled as a Named Entity occurs in the NNE sectionof a dictionary but not in its NE section, the label is removed.

2. If a word not labeled as a Named Entity occurs in the NEsection of a dictionary but not in its NNE section, the word islabeled as a Named Entity.

I Yields significant improvement in NER performance.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 15: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Results

Version Prec Recl F1Window w/o PP 0.676 0.694 0.685

Window with PP 0.818 0.704 0.757

GIZA++ w/o PP 0.534 0.900 0.670

GIZA++ with PP 0.754 0.886 0.815

Table: Results for Swahili NER

System Prec Recl F1SYNERGY Window w/o PP 0.680 0.502 0.578

SYNERGY Window with PP 0.848 0.600 0.703

SYNERGY GIZA++ w/o PP 0.530 0.702 0.603

SYNERGY GIZA++ with PP 0.761 0.817 0.788

Benajiba and Rosso, 2008 0.869 0.727 0.792

Table: Results for Arabic NER

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 16: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Error Analysis

I Three possible types of errors:

I Named Entities that are lost during translation from the sourcelanguage to English.

I Errors made by the English NER module of SYNERGY

I A correctly recognized English NE word that gets mapped tothe wrong word in the source language document duringalignment

I Analysis of distribution of SYNERGY’s errors requires NE goldstandard data for the English translations of our Swahili andArabic test sets in addition to native gold standard data.

I Since no other known systems employ an MT based approachto NER, such data is not currently available.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 17: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Example - Swahili

True NEs in Bold, System output in Red.

I Sample Swahili sentence: “Lundenga aliwataja mawakalaambao wameshatuma maombi kuwa ni kutoka mikoa yaIringa Dodoma Mbeya Mwanza na Ruvuma.”

I Translated English sentence: “Lundenga mentioned shatumaagents who have a prayer from the regions of Dodoma IringaMwanza Mbeya and Ruvuma.”

I NE labeled English sentence: “Lundenga mentioned shatumaagents who have a prayer from the regions of Dodoma IringaMwanza Mbeya and Ruvuma.”

I After GIZA++ alignment and post-processing: “Lundengaaliwataja mawakala ambao wameshatuma maombi kuwa nikutoka mikoa ya Iringa Dodoma Mbeya Mwanza naRuvuma.”

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 18: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Observations

I Initial motivation validated by SYNERGY’s results: Althoughthere are many inaccuracies in both the Swahili and Arabictranslations, a vast majority of NE words are preserved acrosstranslation and successfully recognized by an English NERsystem.

I Performing English NER helps us avoid difficulties inherent innative Swahili and Arabic NER, e.g. ambiguous functionwords, recognizing clitics, etc.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 19: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Future work - Co-reference Resolution

I Determining if two entity mentions in a text refer to the sameentity in real world or not.

I Three types of entity mentions:

1. Named mentions (i.e. NEs) (e.g. Barack Obama)2. Nominal mentions (e.g. President of United States)3. Pronominal mentions (e.g. he).

I Poor Results. Why?

I Ans: Unlike Named Entities, nominal and pronominalmentions not preserved during MT.

I Possible area of future Research.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 20: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Conclusion

I Best-case F1 = 0.788 for Arabic and F1 = 0.815 for Swahili.

I SYNERGY’s F1 score for Arabic comes very close to thestate-of-the-art.

I SYNERGY’s F1 score for Swahili is a good first effort in thefield of Swahili NER.

I Freely available at http://www.cs.cmu.edu/~encore/. Wehope it will be a valuable tool for researchers wishing to workwith Swahili text.

I Intend to use SYNERGY to perform NER for otherresource-scarce languages supported by online translators.

I Related Talk: “ENCORE: Experiments with a Synthetic EntityCo-reference Resolution Tool” by Bo Lin, LREC Workshop onResources and Evaluation for Entity Resolution and EntityManagement (W16), May 22, 2010.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 21: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

And we’re done.

Thank You!

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

Page 22: SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Example - Arabic

I Arabic sentence (in Buckwalter encoding): “n$yr AlY h*h AlZAhrpl¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyrjwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mnAlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAtAlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”

I NE labeled English sentence: “Refer to this phenomenon because itis the atmosphere with a team of President George W. Bush Riceand Ambassador John Bolton at the United Nations which willrecognize a lot of elements that might make some Europeans todeter the persistence of the United States decision to invest in theIsraeli aggression on Lebanon.”

I After GIZA++ and post-processing: “n$yr AlY h*h AlZAhrp l¿nhAt$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwnbwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASrAlty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAtAlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY


Recommended