Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 1
ACIDCA ’2000, Monastir, 21-24/3/2000
Christian Boitet
GETA, CLIPS, IMAG, Grenoble
Handling texts and corpuses in Ariane-G5,
a complete environment for multilingual MT
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 2
Outline Introduction Multilingual MT-R (for revisors): linguistic
methodology & basic software Goals and linguistic methodology Ariane-G5, an MT shell for building multilingual MT-R
systems What has been and is done with Ariane-G5:
MT-R, MT-A (for authors), MT of speech
Representation of input documents Structuration of corpuses Functionalities during processing
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 3
MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY
Produce RAW translation GOOD ENOUGH to be revised
Specialize to SUBLANGUAGES and use MULTILEVEL TRANSFER
(semantic + traces) HEURISTIC PROGRAMMING
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 4
MULTILINGUAL MT-R: BASIC DIAGRAM
umc-structure
uma-structure
umc-structure
Source LanguageText
Target Language 1Text
umc-structure
Target Language 2Text
uma-structure uma-structure
gma-structure gma-structure
paraphrase choice.
Morphological Analysis
Abstraction
Structural Analysis
Structural Generation
Morphological Generation
Syntactic Generation
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 5
Ariane-G5 (1978-99) : structure
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 6
DB of lingware components Declaration of variables (= typed attributes),
templates… Dictionaries Grammars (rules = transitions of abstract automata)
DB of texts Corpuses Source texts Intermediate results Translations (± revisions)
Ariane-G5: 2 specialized DB
relative to “variants”=>
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 7
What has been and is done with Ariane-G5:
MT-R (for revisors)Large, operational systems: RU—>FR, FR—>EN
Prototypes: EN—>MY, TH, FR
Lots of mockups MT-A (for authors)
LIDIA mockups: FR—>DE, EN, RU (adding CH) MT of speech (for task-oriented dialogues)
CSTAR demo system (EN, DE, KR, IT, FR, JP)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 8
MT-R examples of translation (1)français-anglais en aéronautique (avant révision humaine)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 9
MT-R examples of translation (2)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 10
MT-A example of a disambiguation dialogue
Le capitaine a rapporté des tasses et des assiettes bleues
—> The captain has brought back blue bowls and plates/ bowls and blue plates OO des tasses bleues et des assiettes bleues
O des assiettes bleues et des tassesQuestion 1
OO capitaine de marine
O capitaine d’aviation
O capitaine d’artillerie
O capitaine d’infanterie
O capitaine de cavalerie
O …
Question 2
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 11
e-mail servere-mail server
Interaction in source for the “quality MT for all”
Example scenario : multilingual e-mail (UNL)
e-mail tool
Nicknames + language preferences
e-mail tool
Nicknames + language preferences
enconversion serverenconversion server
analysis serveranalysis serverinteractive disambiguation server
interactive disambiguation server
decoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding server
decoding serverdecoding serverdeconversion servers
deconversion servers
1
2
65
7
8
9
Addressees’ e-mail serversAddressees’ e-mail servers
10
4
3
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 12
Other future possibility: production of multilingual “self-explaining documents”
structure MMC
structure UMC
structure UMA
structure UMC
Texte en langue source
Texte en langue cible 1
structure UMC
Texte en langue cible 2
désambiguïsation interactive
structure GMA
structure UMA
structure UMC
rétro-traduction
Rétro-traduction 1
Utilisateur
structure MMC
désambiguïsation "muette" simulée (DMS) DMS
m.a.&d.marques d'ambiguïté
et dialogue
structure MMC
structure UMA structure UMA
structure GMA structure GMA
choix de paraphrase
marques d'ambiguïté et dialogue
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 13
Speech Translation:advantages of an Interchange Format
N target languages for the cost of one analysis Translating into one’s language from N source
languages with one generation Using the same generation to “backgenerate”
Analysis into IF
IFBackgeneration
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 14
Interchange Format : example
la semaine du 12 nous avons des chambres simples et doubles disponibles
give-informationgive-information++availabilityavailability++roomroom(room-type=(room-type=((single ; doublesingle ; double), time=(), time=(week, md12week, md12))))
give-informationgive-information ++availabilityavailability++roomroom (room-type=((room-type=(single ; doublesingle ; double), time=(), time=(week, md12week, md12))))
Acte de dialogueActe de dialogue
ConceptsConcepts
ArgumentsArguments
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 15
Interface of CLIPS++ CSTAR-II demonstratorReconnaissance IF Rétrogénération (pour contrôler la “compréhension”)
Génération
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 16
Hardware architecture of the CLIPS++ CSTAR-II demonstrator
FIF
MontpellierGrenobleRNIS
Reco
Ethernet
Contrôle, IFFSynthèseVC IU
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 17
Steps in translating a text
Build its hierarchical structureChapters, sections, paragraphs, [sentences]
Segment into translation unitsAccording to current length parameter [min..max]
Translate each segmentAdding segment results to text results for desired
phases Revise (manually) the whole translations, keep
the revisions
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 18
Representations of input documents
3 main questions: how to represent the writing system, separate formatting tags from the text or not, how to handle non-textual elements (figures, icons, or
formulas) contained in utterances
Transliterations of textual elements Keeping formatting tags in the texts Non-textual elements
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 19
Facilitate string-matching operations Diminish the size of dictionaries
Represent diacritics
Make some processing easier for some toolskataba —> ktb$aaa, katub —> ktb$au- or ktb$-ua
Transliterations of textual elements
lisp Lisp LISPLISP *LISP **LISP
François va à ACIDCA’2000*FRANC!4OIS VA A!2 **ACIDCA'2000
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 20
Transliterations of textual elements (2) Represent writing systems using non Roman
characters"мать" (mother) —> "MATQ" and not "MAT6"‡ fl  ˝ Ë ˚ Ó fi Û ˛ ÈA YA E YE I YI O E!1 U YU JÁ Ê Í ˜ Ò ¯ Ú ˘ ¸ ˙Z ZH K KH S SH T TH Q W
今日は京都へ行きます。 (Today theme Kyoto dest go.) —>
KYOU <kj k1=kon k2=nichi> WA <hg ha> KYOUTO <kj k1=higashi k2=toukyo-no-tou> E <hg he> IKI <kj k1=iku> MASU.
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 21
Keeping formatting tags in the texts
If the translation units get larger, almost all tags become “inside tags”
Tags often have a linguistic roleFor example, a sentence may contain• a bullet list• or a numbered listwhich are normally linguistically homogeneous.
<P>For example, a sentence may contain</P><UL> <LI>a bullet list <LI>or a numbered list</UL><P>which are normally linguistically homogeneous. </P>
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 22
Non-textual elements
Formulas, figures, icons, brand names, anchors, links…are often best replaced by tags or special occurrences
The situation may be recursive (text inside figures)
*IF x2+5y>3 , x+y IS CONVENIENT .
*IF <relation 1> , <entity 2> IS CONVENIENT .
*IF $$R-1 , $$E-2 IS CONVENIENT .
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 23
Structuration of corpuses
Motivations for corpuses Segmentation and structuration Representation of texts, intermediate results,
translations and revisions
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 24
Motivations for corpuses
Corpus = collection of texts sharing some factual characteristics:
• natural language
• transliteration and method for handling formatting information and non-textual elements
• segmentation method
• structuration method
some management information:
• source (journal/volume, book/chapter…)
• usage destination (send back, postedit, tests…)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 25
Segmentation and structuration "segmentation"
= input texts —> words, sentences…best done by the morphological analyzer
& units of translation "structuration"
= segmentation —> higher level units paragraphs, sections, etc.
+ production of a corresponding tree structure In Ariane-G5, up to 7 hierarchical separators
for a given corpus
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 26
Representation of texts, intermediate results, translations and revisions
Corpus = list of text files + descriptor Text = (transliterated) text + descriptor
(+ non-textual elements replaced by tags or spec.occs) Intermediate result = list of decorated trees
+ descriptor (lingware variant + interval processed) Translation = (transliterated) text + descriptor
(transliterated form may reduce morph. gen. size) Revision = (transliterated) text + descriptor
(usually another, more natural transliteration)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 27
Functionalities during processsing
Ensuring coherence between lingware and results
Stopping & restarting processing of a text Reusing intermediate results
recovery from interruptions debugging multitarget translation (analysis ≈ 2/3 of translation
time)
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 28
Conclusion and perspectives (1)
Text & corpus handling in complete MT systems is quite complex and interesting…�handling texts and corpuses not a straightforward
problem,�suggests many interesting technological and
scientific issues
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 29
Conclusion and perspectives (2)
but more is coming:Synergy MT systems <—> TA (Translation Aids)
unification of the representations of texts in both worlds: • MT: revised texts structured as input texts,
=> the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results)
• TA: translation memories from "bags" to structured translation memories (keeping the sequential context)
both: multiple-layer translation memories• lemmatized forms
• "concrete" syntactic trees & "abstract" logico-semantic trees
• formatting tags
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 30
Conclusion and perspectives (3)
Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine »
some sublanguages are good for MT, bad for TA
• weather bulletins
others are good for TA, bad for MT
• weather related warnings, slightly modified versions of already translated documents
and others are best kept for specialists
• Fine-tune legal sentences