Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation...

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

User-focused task-oriented MT evaluation

for wikis: a case studyFederico Gaspari, Antonio Toral, and Sudip Kumar

Naskar

School of Computing

Dublin City University

Dublin 9, Ireland

{fgaspari, atoral, snaskar}@computing.dcu.ie


Outline

2

• Introduction: the CoSyne project

• Related work

• Evaluation

o framework, scenario, questionnaire

• Results and discussion

• Conclusions

• Future work

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 3

• Aim: Synchronisation of multilingual wikis

• Consortiumo 7 partners from Germany, Italy, the Netherlands and Ireland

• 3 academic partnerso University of Amsterdam (UvA)o Fondazione Bruno Kessler (FBK)o Dublin City University (DCU)

• 1 research organizationo Heidelberg Institute for Theoretical Studies (HITS)

• 3 end-userso Deutsche Welle (DW)o Netherlands Institute for Sound and Vision (NISV)o Vereniging Wikimedia Nederland (VWN)

Introduction: CoSyne


• Techniques used by the CoSyne system:o MTo Textual entailmento Document structure modellingo Overlap synchronisation

o Insertion point detection

• CoSyne MT system developed by UvA (Martzoukos and

Monz, 2010)

• Language pairs covered in year 1: DE / IT / NL ↔ EN

• Focus of this user evaluationo CoSyne MT software to translate wiki entries DE→EN and

NL→EN

Introduction: CoSyne


Related work

5

• MT quality evaluationo fluencyo adequacy

• Automatic MT evaluation metrics, esp. for SMT (Toral et

al., 2011)

o BLEU (Papineni et al., 2002), METEOR (Banerjee & Lavie,

2005), etc.

o no insight into the nature and severity of errors (e.g. for post-

editing)

o weak correlation with human judgement (Lin & Och, 2004)

• Usefulness of MT output and users’ level of satisfaction

• Post-editingo effort (e.g. Allen, 2003; O’Brien, 2007; Specia & Farzindar,

2010)

o gains vs. translating from scratch (e.g. O’Brien, 2005; Specia

2011)


Evaluation framework

6

• User-focused task-oriented evaluation of MT in/for

wikis

o in close collaboration with end-users (DW, NISV)

• Accompanied by diagnostic evaluation

o providing useful feedback to MT developers (UvA)

• Pilot study conducted just before month 18 of 36-month

project

o full-scale final evaluation planned at the very end of the

project


• Protocol for evaluation agreed between DCU and end-

users

• DW and NISV staff involved: editors, translators, project

managers

o German-English and Dutch-English as their working

languages

o final users of the CoSyne system for wiki content

synchronization

• Evaluation conducted on typical wiki entries for end-

users

• Users asked to focus only on linguistic quality and level

of

usefulness of MT (disregarding other components of the

CoSyne system)

Evaluation scenario


Evaluation scenario

8

Deutsche Welle (DW): KalenderBlatt / Today in History


Evaluation scenario

9

Netherlands Institute for Sound and Vision (NISV): wiki


Evaluation scenario

10

Netherlands Institute for Sound and Vision (NISV): wiki


• Time-tracking system was implemented

• Post-editing changes performed by the participants

were logged

• Before the evaluation

o participants given presentation and demo of the CoSyne

system

o preliminary experimentation with the CoSyne system for

1-3 hours

Evaluation scenario


• Written questionnaire administered on papero available at http://www.computing.dcu.ie/~atoral/cosyne/quest.pdf

• Questions grouped into 6 parts focusing on different

aspects

• Approximately 50 items using different formats

o Likert scale, multiple choice and open questions

• Part A: basic demographic information about the respondents

• Part B: previous use of MT

• Part C: users' evaluation of the CoSyne MT system

• Part D: post-editing work

• Part E: general comments and feedback

• (Part F: usability and interaction design of the overall CoSyne system)

Evaluation questionnaire


Results: demographics

13

• 10 users: 6 from DW, 4 from NISV

• 6 men and 4 women across DW and NISV

• Variety of roles: editors, authors, translators and project

managers

• Average age: 34 (youngest 20, oldest 46)

• Average work experience: just over 3 years (min. 3 months,

max. 10 years)


• All (4) NISV staff were native speakers of Dutch

• 5 DW users were German native speakers + 1 NS of

Romanian

fluent in German

• 80% of the participants self-rated their knowledge of

English as

upper-intermediate, 20% defined it as intermediate or

excellent

o None of the respondents considered themselves

bilingual

Results: background


• 80% had used MT before our experiment

o 7 for personal reasons, 6 for work (commonly for both purposes)

o all but one had used Google Translate, 1 had tried Babel Fish, 2

both

• Language combinations used

o 4 from EN into other languages

o 6 into EN from a range of source languages

o 5 language combinations not involving English

• 75% used MT for assimilation purposes vs. 25% for

dissemination

• 62.5% had post-edited raw MT to obtain high-quality

translations

Results: previous use of MT


• Materials translated with MT by the 8 respondents

o for study purposes (academic papers and uni-related

texts): 3

o business correspondence, personal or professional emails:

2

o contracts and technical documents: 2

o online articles: 2

o websites: 2 (“the translations of Dutch sites to English were

hilarious!”, but not using CoSyne MT

system!!)

Wikipedia content: 1



• Overall the 8 respondents had a predominantly negative-to-neutral impression of MT quality before taking part in the evaluation of the CoSyne MT system, based on a 5-point Likert scale (average 2.8 / 5)

Quality of previously used MT systems on a 5-point scale


(1 = very poor to 5 = very good)


Results: CoSyne MT system

Quality and usefulness of the

CoSyne MT systemon a 5-point scale

• Average quality is medium (3 / 5), better than previous experience (2.8)

• Usefulness slightly higher than medium (3.3 /5)

(cf. 2.8)

quality usefulness

(1 = very poor to 5 = very good)



Is CoSyne MT faster than translating wiki entries into English

from scratch?on a 7-point scale

• Average value higher than mid-point of the scale (4.6 / 7)

• In line with e.g. Plitt & Masselot (2010) and Flournoy & Rueppel (2010)

• From DE almost twice as good as from NL (due to style of wiki texts?)

(1 = strongly disagree to7 = strongly agree)



MT quality broken down

into: accuracy

correctness, comprehensibilit

yreadability

styleon a 7-point scale

• We did not explain to users the subtle differences involved

•Only accuracy is approx. average (3.6 / 7), other criteria lower

•None of the average values particularly poor (DE always better than NL)

accu corr compread styl

(1 = poor to 7 = excellent)


Results: post-editing CoSyne

Amount of work, in terms of time and

effort to post-edit the MT output

Need to refer to source language while post-

editing

time effort

on a 7-point scale(1 = short/small to 7 = long/large)

frequency

on a 7-point scale(1 = never to 7 =

always)


Results: post-editing CoSyne

Severity of errors overpost-editing operations

Frequency of errors overpost-editing operations

insertiondeletion

substitution

reorderingins del sub reo ins del sub reo

on a 7-point scale(1 = irrelevant to 7 = very serious)

on a 7-point scale(1 = absent to 7 = frequent)


• Positive aspects:o good to have draft translation to work upono integration in the wiki environmento potential to speed up the translation task

• Weaknesses:o translation quality needs improving, due to

wrong translation of pronouns verbs frequently dropped incorrect word order mistranslated compounds limited lexical coverage (OOV items is an issue)

• Good potential of the CoSyne system based on first

prototype

Results: final comments


Conclusions

24

• User-focused task-oriented questionnaire-based evaluation

for

MT used in wikis, supported by post-editing

• Evaluation of the first Y1 prototype of the CoSyne MT

system for

DE→EN and NL→EN

• Quality of the CoSyne MT system perceived by the users

higher

than that of previously used MT systems

• Post-editing effort is considered high, but users found it

less time-

consuming than translating from scratch

• Translations from German rated better than those from

Dutcho contrasts with earlier findings (Toral et al., 2011)o further investigation into this discrepancy (meta-evaluation)


Future work

25

• Extend analysis looking into the post-editing logs,

considering

actual post-editing time (to estimate costs)

• Involve more users after pilot stage

• Include a control group (translating manually or other MT s/w)

• Investigate correlation between the post-editing

carried out by

the users and the results provided by TER and TERp

(ins, del…)

• Use our linguistically-aware diagnostic evaluation tool

(DELiC4MT) to monitor performance of the MT

system on

specific issues flagged up by the users


Thank you for your attention!Questions?

User-focused task-oriented MT evaluation

for wikis: a case studyFederico Gaspari, Antonio Toral, and Sudip Kumar Naskar

School of ComputingDublin City University

Dublin 9, Ireland

{fgaspari, atoral, snaskar}@computing.dcu.ie

Date post:	01-Apr-2015
Category:	Documents
Upload:	jamya-carr
View:	212 times
Download:	0 times

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation...

Documents