Download - How translators work in real life: SCATE observations

How translators work in

real life:

SCATE observations

Frieda Steurs

Iulianna van der Lek-Ciudin

Tom Vanallemeersch

What & Why

Improve translation efficiency and consistency

Underexploited translation resources

Poor integration of speech recognition

Overloaded interfaces

March 2014 - February 2018

Consortium

Centre for Computational Linguistics, University of Leuven

Industrial Advisory Committee

Today’s

focus

Methods Survey

Contextual inquiries

Methods

Survey: Dec 2014 – Feb 2015

46 questions

187 complete responses (75% from EU)

73 % freelance translators

25 % in-house translators

Few terminologists, interpreters, project

managers, post-editors

Contextual Inquiries: Nov 2014 - June 2015

16 professionals at their workplaces (BE, NL, LU)

Semi-structured interviews, observations, think-aloud,

post-interviews

Whom did we observe?

Organization type

Small translation agency

Medium-size translation/interpreting agency

Public institution

Freelance

Language pairs EN-NL/NL-EN , FR-NL, EN-FR, EN-RO

Translation experience 2-5 years vs. 5 + years

Domains of expertise Legal, ICT, Medical, Marketing

Main TEnT

Trados Studio 2014, Trados Studio 2011, Trados

Workbench, Déjà Vu X3, memoQ 2014,

Wordbee

Experience with TEnT <1 year (2) vs. 5+ years

Main findings and implications Needs and shortcomings of tools

Observation of terminological strategies

Translators’ Linguistic Resources

Resource State-of-the-art Opportunities

Translation

Memories

• Heavily used

• Concordance, term look-up

features, term extraction

• Term extraction rarely used

• Alignment

• No support for comparable

corpora (possible to upload

monolingual documents for

reference)

• Syntactic concordance

• Bilingual/multilingual

term extraction

• More focus on

monolingual corpora

• Features to compile

and query comparable

corpora

Online Translation

Memories

• Perform look-up during

translation

• Automatic insertion

• Concordance searches

• Moderate-low quality control

• More advanced filtering

techniques

• QA tools



Local term bases • Usage is still low (SCATE

survey -> 52%)

• Automatic term recognition

• Basic categories

• TBX not adopted by all tool

developers

• Users prefer to exchange

data in CSV, Excel

• Improve usability

• More flexibility and

customization to suit

users with different

needs

• A unified interface for

online/local term bases

• Support for ontologies

Online term banks Perform look-up (exact/fuzzy)

during translation

Advanced pre-filtering,

techniques, better look-up

interfaces

Online dictionaries,

search engines

Consulted either online or via

a WebSearch feature in CAT

Concordance-like searches

directly from the translation

editor



Machine Translation • Usage is still low (SCATE

survey 27%)

• Consulted online

• Via API in CAT

• Segment assembly

(DejaVu, memoQ)

• Autocompletion

suggestions (SDL Trados)

• Adaptive MT (MateCat,

Lilt)

• Improve confidence

estimation

• Interfaces for post-

editing

• Train own MT engine

with own TMs, TBs

Term collection

• Manually (88%)

• Semi-automatically via term extraction programs (22%)

Term storage

• CAT TB (52%) Most frequent form/canonical form

• MS Excel (43%) The language equivalents (56%)

• MS Word (27%)

Term research

• Online resources (94%)

• Personal resources (85%)

• Client’s resources (64%)

SCATE Users’ survey 2014-2015

187 survey participants

139 perform terminology activities

Search engines

Google

Bing

Online dictionaries

Oxford

Proz.com

Van Dale

TermWiki Search

TermCoord glossary links

Term banks

IATE

Termium Plus

EuroTerm Bank

FAOTERM

WTOTERM

Monolingual Corpora

Eur-lex

Global web-based English

British National corpus

Corpus of contemporary AE

Parallel corpora

Linguee

Europarl

Glosbe

TAUS Search


Most used online terminology resources

Reasons for NOT managing terminology

No knowledge about terminology management

theory and principles

It is the responsibility of somebody else

It has no added value

It is a time-consuming task

Term bases are complex

Reliance on the translation memories


Systematic terminology

management

• Collect terms and concepts

from global field

• Construct a concept

system

• Create well-structured

definitions

• Create term entries

Ad-hoc terminology

management

• Identify terms in isolated

contexts

• Create initial term entries

• Add definition, context ….

Adapted from Handbook of

Terminology Management Vol 1.

Medium & small

LSPs, freelancers

In-house translation

departments of large

organizations

Terminology strategies

Institution In-house translation

departments

Translators / terminologists

In-house terminology coordination

Systematic and ad-hoc terminology management

Term extraction – not a standard practice!

16

Terminology tools Translation

tools

IATE database SDL Trados

Studio

Eur-Lex In-house MT

Quest Metasearch (Bilingual) Voice recognition

Euramis Concordance

DGT Vista

Electronic dictionaries,

glossaries

Term extraction tools:

SynchroTerm, SDL MultiTerm

Extract, TermTreffer

External corpus query tools,

e.g. TextStat


Adapted after TermCoord documentation


Proactive terminology management

Preparation of “TermFolders” for important legislative

procedures:

Desktop research

Manual collection of web links and relevant

documents

Manual identification and extraction of term

candidates

….


Time-consuming No GlobalSearch

DIY Corpora

tools?

SCATE?


Small and medium-size LSPs, freelancers

Mainly ad-hoc, basic terminology management due to:

o Time pressure

o Lack of financial compensation

o Over-reliance on translation memories

o A general lack of knowledge and awareness of the

benefits of terminology management

o Not familiar with corpus compilation and query tools

Ad-hoc terminology strategies during translation

• LGP, terminology, phraseology, names of entities, typography/punctuation…

• Highlight or copy/paste SL term

Identify problem

• Local resources: Concordance, Term Look-up, Find & Replace, Global search

• Online resources via WebSeach or other integrated widgets

• MT via plugins, if available & allowed

• Online resources: Google -> Top hits (Bookmark link?)

• Contact client via e-mail or an online query spreadsheet

• Contact subject matter experts

Search for a solution

• One click

• Copy/paste

Insert translation

• Term base / Excel

Save terms

Implications

For translators, project managers, terminologists,

interpreters, translators’ educators:

Basic knowledge of terminology theory and practice

Terminology management tools

Preparation of glossaries before the start of the

project with the help of:

Corpus compilation and query tools (BootCat,

AntConc, SketchEngine)

Term extraction tools (SynchroTerm, Similis)

More focus on comparable corpora

Implications

For software developers:

Focus more on usability and personalization

Unified interfaces between local and online

resources

More sophisticated search functionalities

Integrate online resources that are

actually used by the users

More focus on comparable corpora

SCATE approach

SCATE research

Improvement of bilingual and multilingual term

extraction techniques from comparable

corpora

Integration of a syntactic concordancer in

parallel corpora: e.g. Poly-Gretel

Multilingual term extraction from

comparable corpora

A gold standard for Automatic Terminology Extraction

Compilation – Annotation – Evaluation

# words Hartfailure Wind

energy Corruption

Corruption

(parallel)

English 48.843 324.842 454.904 179.229

French 55.383 358.853 547.072 230.874

Dutch 50.850 315.605 476.179 223.495

Annotation: 4 labels (Term, Common Term, Out of Domain

Term and Named Entity) with elaborate and practical

guidelines

Evaluation: inter-annotator agreement between 3 annotators

after 2 iterations (av. f-score = 0,895; av. Cohen‘s kappa =

0.927)

Future work: linking the annotations in the comparable

medical corpus across all 3 languages

A Gold Standard for Automatic

Terminology Extraction

Bilingual lexicon induction from comparable

corpora

Techniques for extracting word representations:

o multilingual topic models

o multilingual word embedding models

o character-level representations

Comparable corpora

Cross-lingual semantic word representations

Bilingual lexicon

Best results

Poly-Gretel

Bilingual syntactic concordancer

Query parallel corpora

Available online at: http://gretel.ccl.kuleuven.be/poly-gretel/ebs/input.php?1477144000

Target audience:

Computer-assisted language learning (CALL)

Translators

Translation studies and comparative linguistics

Poly-Gretel

EN noun + report ↔ NL verslag + prep + noun

Example query:

Poly-Gretel

EN noun + report ↔ NL verslag + prep + noun

EN-NL constituents are automatically aligned

Poly-Gretel

EN noun + report ↔ NL noun

Example query:

Many compounds are possible

More about SCATE https://www.arts.kuleuven.be/ling/ccl/projects/scate

[email protected]

[email protected]

[email protected]

https://www.arts.kuleuven.be/ling/ccl/projects/scate

https://www.arts.kuleuven.be/ling/ccl/projects/scate