Download - Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 1

ACIDCA ’2000, Monastir, 21-24/3/2000

Christian Boitet

GETA, CLIPS, IMAG, Grenoble

Handling texts and corpuses in Ariane-G5,

a complete environment for multilingual MT


Outline Introduction Multilingual MT-R (for revisors): linguistic

methodology & basic software Goals and linguistic methodology Ariane-G5, an MT shell for building multilingual MT-R

systems What has been and is done with Ariane-G5:

MT-R, MT-A (for authors), MT of speech

Representation of input documents Structuration of corpuses Functionalities during processing


MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY

Produce RAW translation GOOD ENOUGH to be revised

Specialize to SUBLANGUAGES and use MULTILEVEL TRANSFER

(semantic + traces) HEURISTIC PROGRAMMING


MULTILINGUAL MT-R: BASIC DIAGRAM

umc-structure

uma-structure

umc-structure

Source LanguageText

Target Language 1Text

umc-structure

Target Language 2Text

uma-structure uma-structure

gma-structure gma-structure

paraphrase choice.

Morphological Analysis

Abstraction

Structural Analysis

Structural Generation

Morphological Generation

Syntactic Generation


Ariane-G5 (1978-99) : structure


DB of lingware components Declaration of variables (= typed attributes),

templates… Dictionaries Grammars (rules = transitions of abstract automata)

DB of texts Corpuses Source texts Intermediate results Translations (± revisions)

Ariane-G5: 2 specialized DB

relative to “variants”=>


What has been and is done with Ariane-G5:

MT-R (for revisors)Large, operational systems: RU—>FR, FR—>EN

Prototypes: EN—>MY, TH, FR

Lots of mockups MT-A (for authors)

LIDIA mockups: FR—>DE, EN, RU (adding CH) MT of speech (for task-oriented dialogues)

CSTAR demo system (EN, DE, KR, IT, FR, JP)


MT-R examples of translation (1)français-anglais en aéronautique (avant révision humaine)


MT-R examples of translation (2)


MT-A example of a disambiguation dialogue

Le capitaine a rapporté des tasses et des assiettes bleues

—> The captain has brought back blue bowls and plates/ bowls and blue plates OO des tasses bleues et des assiettes bleues

O des assiettes bleues et des tassesQuestion 1

OO capitaine de marine

O capitaine d’aviation

O capitaine d’artillerie

O capitaine d’infanterie

O capitaine de cavalerie

O …

Question 2


e-mail servere-mail server

Interaction in source for the “quality MT for all”

Example scenario : multilingual e-mail (UNL)

e-mail tool

Nicknames + language preferences

e-mail tool

Nicknames + language preferences

enconversion serverenconversion server

analysis serveranalysis serverinteractive disambiguation server

interactive disambiguation server

decoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding server

decoding serverdecoding serverdeconversion servers

deconversion servers

1

2

65

7

8

9

Addressees’ e-mail serversAddressees’ e-mail servers

10

4

3


Other future possibility: production of multilingual “self-explaining documents”

structure MMC

structure UMC

structure UMA

structure UMC

Texte en langue source

Texte en langue cible 1

structure UMC

Texte en langue cible 2

désambiguïsation interactive

structure GMA

structure UMA

structure UMC

rétro-traduction

Rétro-traduction 1

Utilisateur

structure MMC

désambiguïsation "muette" simulée (DMS) DMS

m.a.&d.marques d'ambiguïté

et dialogue

structure MMC

structure UMA structure UMA

structure GMA structure GMA

choix de paraphrase

marques d'ambiguïté et dialogue


Speech Translation:advantages of an Interchange Format

N target languages for the cost of one analysis Translating into one’s language from N source

languages with one generation Using the same generation to “backgenerate”

Analysis into IF

IFBackgeneration


Interchange Format : example

la semaine du 12 nous avons des chambres simples et doubles disponibles

give-informationgive-information++availabilityavailability++roomroom(room-type=(room-type=((single ; doublesingle ; double), time=(), time=(week, md12week, md12))))

give-informationgive-information ++availabilityavailability++roomroom (room-type=((room-type=(single ; doublesingle ; double), time=(), time=(week, md12week, md12))))

Acte de dialogueActe de dialogue

ConceptsConcepts

ArgumentsArguments


Interface of CLIPS++ CSTAR-II demonstratorReconnaissance IF Rétrogénération (pour contrôler la “compréhension”)

Génération


Hardware architecture of the CLIPS++ CSTAR-II demonstrator

FIF

MontpellierGrenobleRNIS

Reco

Ethernet

Contrôle, IFFSynthèseVC IU


Steps in translating a text

Build its hierarchical structureChapters, sections, paragraphs, [sentences]

Segment into translation unitsAccording to current length parameter [min..max]

Translate each segmentAdding segment results to text results for desired

phases Revise (manually) the whole translations, keep

the revisions


Representations of input documents

3 main questions: how to represent the writing system, separate formatting tags from the text or not, how to handle non-textual elements (figures, icons, or

formulas) contained in utterances

Transliterations of textual elements Keeping formatting tags in the texts Non-textual elements


Facilitate string-matching operations Diminish the size of dictionaries

Represent diacritics

Make some processing easier for some toolskataba —> ktb$aaa, katub —> ktb$au- or ktb$-ua

Transliterations of textual elements

lisp Lisp LISPLISP *LISP **LISP

François va à ACIDCA’2000*FRANC!4OIS VA A!2 **ACIDCA'2000


Transliterations of textual elements (2) Represent writing systems using non Roman

characters"мать" (mother) —> "MATQ" and not "MAT6"‡ fl Â ˝ Ë ˚ Ó fi Û ˛ ÈA YA E YE I YI O E!1 U YU JÁ Ê Í ˜ Ò ¯ Ú ˘ ¸ ˙Z ZH K KH S SH T TH Q W

今日は京都へ行きます。 (Today theme Kyoto dest go.) —>

KYOU <kj k1=kon k2=nichi> WA <hg ha> KYOUTO <kj k1=higashi k2=toukyo-no-tou> E <hg he> IKI <kj k1=iku> MASU.


Keeping formatting tags in the texts

If the translation units get larger, almost all tags become “inside tags”

Tags often have a linguistic roleFor example, a sentence may contain• a bullet list• or a numbered listwhich are normally linguistically homogeneous.

<P>For example, a sentence may contain</P><UL> <LI>a bullet list <LI>or a numbered list</UL><P>which are normally linguistically homogeneous. </P>


Non-textual elements

Formulas, figures, icons, brand names, anchors, links…are often best replaced by tags or special occurrences

The situation may be recursive (text inside figures)

*IF x2+5y>3 , x+y IS CONVENIENT .

*IF <relation 1> , <entity 2> IS CONVENIENT .

*IF $$R-1 , $$E-2 IS CONVENIENT .


Structuration of corpuses

Motivations for corpuses Segmentation and structuration Representation of texts, intermediate results,

translations and revisions


Motivations for corpuses

Corpus = collection of texts sharing some factual characteristics:

• natural language

• transliteration and method for handling formatting information and non-textual elements

• segmentation method

• structuration method

some management information:

• source (journal/volume, book/chapter…)

• usage destination (send back, postedit, tests…)


Segmentation and structuration "segmentation"

= input texts —> words, sentences…best done by the morphological analyzer

& units of translation "structuration"

= segmentation —> higher level units paragraphs, sections, etc.

+ production of a corresponding tree structure In Ariane-G5, up to 7 hierarchical separators

for a given corpus


Representation of texts, intermediate results, translations and revisions

Corpus = list of text files + descriptor Text = (transliterated) text + descriptor

(+ non-textual elements replaced by tags or spec.occs) Intermediate result = list of decorated trees

+ descriptor (lingware variant + interval processed) Translation = (transliterated) text + descriptor

(transliterated form may reduce morph. gen. size) Revision = (transliterated) text + descriptor

(usually another, more natural transliteration)


Functionalities during processsing

Ensuring coherence between lingware and results

Stopping & restarting processing of a text Reusing intermediate results

recovery from interruptions debugging multitarget translation (analysis ≈ 2/3 of translation

time)


Conclusion and perspectives (1)

Text & corpus handling in complete MT systems is quite complex and interesting…�handling texts and corpuses not a straightforward

problem,�suggests many interesting technological and

scientific issues



but more is coming:Synergy MT systems <—> TA (Translation Aids)

unification of the representations of texts in both worlds: • MT: revised texts structured as input texts,

=> the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results)

• TA: translation memories from "bags" to structured translation memories (keeping the sequential context)

both: multiple-layer translation memories• lemmatized forms

• "concrete" syntactic trees & "abstract" logico-semantic trees

• formatting tags



Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine »

some sublanguages are good for MT, bad for TA

• weather bulletins

others are good for TA, bad for MT

• weather related warnings, slightly modified versions of already translated documents

and others are best kept for specialists

• Fine-tune legal sentences