Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | guy-de-pauw |
View: | 1,287 times |
Download: | 3 times |
SYNERGYA Named Entity Recognition System for Resource-scarce
Languages such as Swahili using Online Machine Translation
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking
May 18th, 2010
Language Technologies InstituteCarnegie Mellon University
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
NER for Resource-scarce languages: Overview
I Named Entity Recognition (NER) is the task of finding namesin text, and optionally, classifying them into persons,organizations, locations, etc.
I Example: Wolff, currently a journalist in Argentina, playedwith Del Bosque in the final years of the seventies in RealMadrid.
I Vital first step for many problems such as parsing, co-referenceresolution, translation, indexing and semantic analysis.
I State-of-art NER systems rely on machine learning.
I Requires large amounts of labeled training data. Costly andtime-consuming.
I Upshot: For many widely spoken languages e.g. Swahili, noNER systems freely available.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Motivation
I Online Machine Translation (MT) systems such as GoogleTranslate and Microsoft Bing Translator support manyresource-scarce languages for which NER systems don’t exist.
I Google Translate supports Swahili ⇐⇒ English translation.
I Quality of translation is far from perfect.
I However, this might still be good enough for NER. Why?
I Observation: Named Entities (NEs) are preserved duringSwahili ⇐⇒ English translation.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Motivation: Named Entity Preservation - Example
All Named Entities shown in Red:
I Sample Swahili sentence: “Lundenga aliwataja mawakalaambao wameshatuma maombi kuwa ni kutoka mikoa ya IringaDodoma Mbeya Mwanza na Ruvuma.”
I English version from Google Translate: “Lundenga mentionedshatuma agents who have a prayer from the regions ofDodoma Iringa Mwanza Mbeya and Ruvuma.”
I Sample English sentence: “President Obama advised GermanChancellor Angela Merkel.”
I Swahili version from Google Translate: “Rais Obamawanashauriwa Ujerumani Angela Merkel.”
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
SYNERGY: Big Picture
1. Swahili text is translated to English.
2. The best off-the-shelf NER systems are applied to theresulting English text.
3. English NEs are mapped back to words in the Swahili textusing word alignment.
4. Post-Processing using dictionaries to improve performance
I Observation: SYNERGY addresses the problem of NER for anew language by breaking it into three relatively easierproblems.
I Observation: No longer need labeled training data.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Comparison to existing NER systems
I No Swahili NER system available, so how do we compareSYNERGY and its MT-based approach?
I Extend SYNERGY to perform NER for another language forwhich freely available NER systems do exist.
I We use Arabic, because it is typically considered a ‘hard’language for which to do NER.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Resources - Machine Translation
I Google Translate for Swahili and Arabic text. Microsoft BingTranslator not yet ready for Arabic, doesn’t support Swahili.
I Would like to use other systems, but for these languages, nonefreely available.
I Advantage: As soon as new languages become available onany MT system, SYNERGY can be used with few changes.
I Can also use MT systems automatically generated fromparallel data using freely available toolkits such as Moses.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Resources - Named Entities
I Swahili labeled data: Helsinki Corpus of Swahili.
I Arabic labeled data: ANERcorp.
I NER: Stanford’s CRF based NER system and UIUC’s LBJ NETagger. State-of-art in English NER.
I Use only off-the-shelf NERs to show that a good NER systemcan be developed for new languages quickly.
I Post-Processing: Swahili-English dictionary by Kamusi Projectand the Linguistic Data Consortium’s Arabic TreeBank.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Architecture: 1 - Translation to English
I Perl interface to Google Translate.
I Translate each sentence individually.
I Many sentences too long for Google Translate API, need to besplit. Named Entities unaffected by this.
I Produces a translated English document for each Swahilidocument.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Architecture: 2 - NER on Translated English Text
I SYNERGY uses a combination of Stanford and UIUC NERsystems called Union.
I Why? Observation: Stanford and UIUC NER systems don’talways make the same mistakes.
I Idea: Exploit this to improve performance.
System Precision Recall F1Stanford 0.962 0.963 0.963
UIUC 0.968 0.964 0.966
Union 0.952 0.985 0.968
Table: Performance of NER systems on CoNLL 2003 Test SetRushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Architecture: 3 - Word Alignment
I Alignment of English NEs with words in original Swahilidocument to discover Swahili NEs.
I Most challenging component of SYNERGY.
I We try two different algorithms:
1. Window Based Alignment.
2. GIZA++ Based Alignment.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Architecture: 3 - Word Alignment - Window Based
I Initial Approach:
1. Translate each English NE word back to a Swahili word
2. Search in the original Swahili document for a match within awindow of words.
I Produces very few matches.
I Modification:
1. Create a new English document by translating each Swahiliword one at a time
2. Search in this new document for English NE words within awindow of words.
3. Keep pointers from each English word in new document toSwahili word that produces it.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Architecture: 3 - Word Alignment - GIZA++ Based
I GIZA++ is state-of-art for word alignment in MachineTranslation.
1. Takes an input language corpus and an output languagecorpus, and automatically generates word classes for both.
2. For each sentence in the input language corpus finds the mostprobable alignment of the corresponding output languagesentence.
I SYNERGY: The translated English documents serve as theinput corpus and the Swahili documents as the output corpus.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Architecture: 4 - Post-Processing
I Compile POS tagged English, Swahili and Arabic dictionaries.I Divide into Named Entity (NE) and Non-Named Entity
(NNE) sections. A word may occur in both sections.
1. If a word labeled as a Named Entity occurs in the NNE sectionof a dictionary but not in its NE section, the label is removed.
2. If a word not labeled as a Named Entity occurs in the NEsection of a dictionary but not in its NNE section, the word islabeled as a Named Entity.
I Yields significant improvement in NER performance.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Results
Version Prec Recl F1Window w/o PP 0.676 0.694 0.685
Window with PP 0.818 0.704 0.757
GIZA++ w/o PP 0.534 0.900 0.670
GIZA++ with PP 0.754 0.886 0.815
Table: Results for Swahili NER
System Prec Recl F1SYNERGY Window w/o PP 0.680 0.502 0.578
SYNERGY Window with PP 0.848 0.600 0.703
SYNERGY GIZA++ w/o PP 0.530 0.702 0.603
SYNERGY GIZA++ with PP 0.761 0.817 0.788
Benajiba and Rosso, 2008 0.869 0.727 0.792
Table: Results for Arabic NER
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Error Analysis
I Three possible types of errors:
I Named Entities that are lost during translation from the sourcelanguage to English.
I Errors made by the English NER module of SYNERGY
I A correctly recognized English NE word that gets mapped tothe wrong word in the source language document duringalignment
I Analysis of distribution of SYNERGY’s errors requires NE goldstandard data for the English translations of our Swahili andArabic test sets in addition to native gold standard data.
I Since no other known systems employ an MT based approachto NER, such data is not currently available.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Example - Swahili
True NEs in Bold, System output in Red.
I Sample Swahili sentence: “Lundenga aliwataja mawakalaambao wameshatuma maombi kuwa ni kutoka mikoa yaIringa Dodoma Mbeya Mwanza na Ruvuma.”
I Translated English sentence: “Lundenga mentioned shatumaagents who have a prayer from the regions of Dodoma IringaMwanza Mbeya and Ruvuma.”
I NE labeled English sentence: “Lundenga mentioned shatumaagents who have a prayer from the regions of Dodoma IringaMwanza Mbeya and Ruvuma.”
I After GIZA++ alignment and post-processing: “Lundengaaliwataja mawakala ambao wameshatuma maombi kuwa nikutoka mikoa ya Iringa Dodoma Mbeya Mwanza naRuvuma.”
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Observations
I Initial motivation validated by SYNERGY’s results: Althoughthere are many inaccuracies in both the Swahili and Arabictranslations, a vast majority of NE words are preserved acrosstranslation and successfully recognized by an English NERsystem.
I Performing English NER helps us avoid difficulties inherent innative Swahili and Arabic NER, e.g. ambiguous functionwords, recognizing clitics, etc.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Future work - Co-reference Resolution
I Determining if two entity mentions in a text refer to the sameentity in real world or not.
I Three types of entity mentions:
1. Named mentions (i.e. NEs) (e.g. Barack Obama)2. Nominal mentions (e.g. President of United States)3. Pronominal mentions (e.g. he).
I Poor Results. Why?
I Ans: Unlike Named Entities, nominal and pronominalmentions not preserved during MT.
I Possible area of future Research.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Conclusion
I Best-case F1 = 0.788 for Arabic and F1 = 0.815 for Swahili.
I SYNERGY’s F1 score for Arabic comes very close to thestate-of-the-art.
I SYNERGY’s F1 score for Swahili is a good first effort in thefield of Swahili NER.
I Freely available at http://www.cs.cmu.edu/~encore/. Wehope it will be a valuable tool for researchers wishing to workwith Swahili text.
I Intend to use SYNERGY to perform NER for otherresource-scarce languages supported by online translators.
I Related Talk: “ENCORE: Experiments with a Synthetic EntityCo-reference Resolution Tool” by Bo Lin, LREC Workshop onResources and Evaluation for Entity Resolution and EntityManagement (W16), May 22, 2010.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
And we’re done.
Thank You!
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
Example - Arabic
I Arabic sentence (in Buckwalter encoding): “n$yr AlY h*h AlZAhrpl¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyrjwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mnAlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAtAlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”
I NE labeled English sentence: “Refer to this phenomenon because itis the atmosphere with a team of President George W. Bush Riceand Ambassador John Bolton at the United Nations which willrecognize a lot of elements that might make some Europeans todeter the persistence of the United States decision to invest in theIsraeli aggression on Lebanon.”
I After GIZA++ and post-processing: “n$yr AlY h*h AlZAhrp l¿nhAt$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwnbwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASrAlty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAtAlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY