Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | jamya-carr |
View: | 212 times |
Download: | 0 times |
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011
User-focused task-oriented MT evaluation
for wikis: a case studyFederico Gaspari, Antonio Toral, and Sudip Kumar
Naskar
School of Computing
Dublin City University
Dublin 9, Ireland
{fgaspari, atoral, snaskar}@computing.dcu.ie
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011
Outline
2
• Introduction: the CoSyne project
• Related work
• Evaluation
o framework, scenario, questionnaire
• Results and discussion
• Conclusions
• Future work
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 3
• Aim: Synchronisation of multilingual wikis
• Consortiumo 7 partners from Germany, Italy, the Netherlands and Ireland
• 3 academic partnerso University of Amsterdam (UvA)o Fondazione Bruno Kessler (FBK)o Dublin City University (DCU)
• 1 research organizationo Heidelberg Institute for Theoretical Studies (HITS)
• 3 end-userso Deutsche Welle (DW)o Netherlands Institute for Sound and Vision (NISV)o Vereniging Wikimedia Nederland (VWN)
Introduction: CoSyne
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 4
• Techniques used by the CoSyne system:o MTo Textual entailmento Document structure modellingo Overlap synchronisation
o Insertion point detection
• CoSyne MT system developed by UvA (Martzoukos and
Monz, 2010)
• Language pairs covered in year 1: DE / IT / NL ↔ EN
• Focus of this user evaluationo CoSyne MT software to translate wiki entries DE→EN and
NL→EN
Introduction: CoSyne
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011
Related work
5
• MT quality evaluationo fluencyo adequacy
• Automatic MT evaluation metrics, esp. for SMT (Toral et
al., 2011)
o BLEU (Papineni et al., 2002), METEOR (Banerjee & Lavie,
2005), etc.
o no insight into the nature and severity of errors (e.g. for post-
editing)
o weak correlation with human judgement (Lin & Och, 2004)
• Usefulness of MT output and users’ level of satisfaction
• Post-editingo effort (e.g. Allen, 2003; O’Brien, 2007; Specia & Farzindar,
2010)
o gains vs. translating from scratch (e.g. O’Brien, 2005; Specia
2011)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011
Evaluation framework
6
• User-focused task-oriented evaluation of MT in/for
wikis
o in close collaboration with end-users (DW, NISV)
• Accompanied by diagnostic evaluation
o providing useful feedback to MT developers (UvA)
• Pilot study conducted just before month 18 of 36-month
project
o full-scale final evaluation planned at the very end of the
project
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 7
• Protocol for evaluation agreed between DCU and end-
users
• DW and NISV staff involved: editors, translators, project
managers
o German-English and Dutch-English as their working
languages
o final users of the CoSyne system for wiki content
synchronization
• Evaluation conducted on typical wiki entries for end-
users
• Users asked to focus only on linguistic quality and level
of
usefulness of MT (disregarding other components of the
CoSyne system)
Evaluation scenario
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 8
Evaluation scenario
8
Deutsche Welle (DW): KalenderBlatt / Today in History
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 9
Evaluation scenario
9
Netherlands Institute for Sound and Vision (NISV): wiki
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 10
Evaluation scenario
10
Netherlands Institute for Sound and Vision (NISV): wiki
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 11
• Time-tracking system was implemented
• Post-editing changes performed by the participants
were logged
• Before the evaluation
o participants given presentation and demo of the CoSyne
system
o preliminary experimentation with the CoSyne system for
1-3 hours
Evaluation scenario
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 12
• Written questionnaire administered on papero available at http://www.computing.dcu.ie/~atoral/cosyne/quest.pdf
• Questions grouped into 6 parts focusing on different
aspects
• Approximately 50 items using different formats
o Likert scale, multiple choice and open questions
• Part A: basic demographic information about the respondents
• Part B: previous use of MT
• Part C: users' evaluation of the CoSyne MT system
• Part D: post-editing work
• Part E: general comments and feedback
• (Part F: usability and interaction design of the overall CoSyne system)
Evaluation questionnaire
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011
Results: demographics
13
• 10 users: 6 from DW, 4 from NISV
• 6 men and 4 women across DW and NISV
• Variety of roles: editors, authors, translators and project
managers
• Average age: 34 (youngest 20, oldest 46)
• Average work experience: just over 3 years (min. 3 months,
max. 10 years)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 14
• All (4) NISV staff were native speakers of Dutch
• 5 DW users were German native speakers + 1 NS of
Romanian
fluent in German
• 80% of the participants self-rated their knowledge of
English as
upper-intermediate, 20% defined it as intermediate or
excellent
o None of the respondents considered themselves
bilingual
Results: background
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 15
• 80% had used MT before our experiment
o 7 for personal reasons, 6 for work (commonly for both purposes)
o all but one had used Google Translate, 1 had tried Babel Fish, 2
both
• Language combinations used
o 4 from EN into other languages
o 6 into EN from a range of source languages
o 5 language combinations not involving English
• 75% used MT for assimilation purposes vs. 25% for
dissemination
• 62.5% had post-edited raw MT to obtain high-quality
translations
Results: previous use of MT
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 16
• Materials translated with MT by the 8 respondents
o for study purposes (academic papers and uni-related
texts): 3
o business correspondence, personal or professional emails:
2
o contracts and technical documents: 2
o online articles: 2
o websites: 2 (“the translations of Dutch sites to English were
hilarious!”, but not using CoSyne MT
system!!)
Wikipedia content: 1
Results: previous use of MT
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 17
• Overall the 8 respondents had a predominantly negative-to-neutral impression of MT quality before taking part in the evaluation of the CoSyne MT system, based on a 5-point Likert scale (average 2.8 / 5)
Quality of previously used MT systems on a 5-point scale
Results: previous use of MT
(1 = very poor to 5 = very good)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 18
Results: CoSyne MT system
Quality and usefulness of the
CoSyne MT systemon a 5-point scale
• Average quality is medium (3 / 5), better than previous experience (2.8)
• Usefulness slightly higher than medium (3.3 /5)
(cf. 2.8)
quality usefulness
(1 = very poor to 5 = very good)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 19
Results: CoSyne MT system
Is CoSyne MT faster than translating wiki entries into English
from scratch?on a 7-point scale
• Average value higher than mid-point of the scale (4.6 / 7)
• In line with e.g. Plitt & Masselot (2010) and Flournoy & Rueppel (2010)
• From DE almost twice as good as from NL (due to style of wiki texts?)
(1 = strongly disagree to7 = strongly agree)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 20
Results: CoSyne MT system
MT quality broken down
into: accuracy
correctness, comprehensibilit
yreadability
styleon a 7-point scale
• We did not explain to users the subtle differences involved
•Only accuracy is approx. average (3.6 / 7), other criteria lower
•None of the average values particularly poor (DE always better than NL)
accu corr compread styl
(1 = poor to 7 = excellent)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 21
Results: post-editing CoSyne
Amount of work, in terms of time and
effort to post-edit the MT output
Need to refer to source language while post-
editing
time effort
on a 7-point scale(1 = short/small to 7 = long/large)
frequency
on a 7-point scale(1 = never to 7 =
always)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 22
Results: post-editing CoSyne
Severity of errors overpost-editing operations
Frequency of errors overpost-editing operations
insertiondeletion
substitution
reorderingins del sub reo ins del sub reo
on a 7-point scale(1 = irrelevant to 7 = very serious)
on a 7-point scale(1 = absent to 7 = frequent)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 23
• Positive aspects:o good to have draft translation to work upono integration in the wiki environmento potential to speed up the translation task
• Weaknesses:o translation quality needs improving, due to
wrong translation of pronouns verbs frequently dropped incorrect word order mistranslated compounds limited lexical coverage (OOV items is an issue)
• Good potential of the CoSyne system based on first
prototype
Results: final comments
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011
Conclusions
24
• User-focused task-oriented questionnaire-based evaluation
for
MT used in wikis, supported by post-editing
• Evaluation of the first Y1 prototype of the CoSyne MT
system for
DE→EN and NL→EN
• Quality of the CoSyne MT system perceived by the users
higher
than that of previously used MT systems
• Post-editing effort is considered high, but users found it
less time-
consuming than translating from scratch
• Translations from German rated better than those from
Dutcho contrasts with earlier findings (Toral et al., 2011)o further investigation into this discrepancy (meta-evaluation)
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011
Future work
25
• Extend analysis looking into the post-editing logs,
considering
actual post-editing time (to estimate costs)
• Involve more users after pilot stage
• Include a control group (translating manually or other MT s/w)
• Investigate correlation between the post-editing
carried out by
the users and the results provided by TER and TERp
(ins, del…)
• Use our linguistically-aware diagnostic evaluation tool
(DELiC4MT) to monitor performance of the MT
system on
specific issues flagged up by the users
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011
Thank you for your attention!Questions?
User-focused task-oriented MT evaluation
for wikis: a case studyFederico Gaspari, Antonio Toral, and Sudip Kumar Naskar
School of ComputingDublin City University
Dublin 9, Ireland
{fgaspari, atoral, snaskar}@computing.dcu.ie