Post on 30-Dec-2015
description
transcript
Creating Translation Contextwith Disambiguation
Tadej Štajner – Jožef Stefan InstituteYves Savourel – ENLASO Corporation
Localization World – London – June 2013
Context: A Shortcoming
• Traditionally, translation tools have been strong on code handling, re-use of existing translations.
• But they have been less good at providing context or linguistic resources for the translators.
• Things are improving and are bound to improve even more.
New Factors
• Component-based processing is becoming wide-spread (i.e. source text goes through several preparation steps: TM, MT, etc.)
• Web services allow a single process to tape on many different resources; specialization becomes easier.
• Now ITS 2.0 provides a common way to carry various information across tools/services.
ITS: Internationalization Tag Set
• A set of common internationalization and localization-related features (called “data categories”) for XML…and now with ITS 2.0 also for HTML5
• ITS 2.0 is being finalized at the W3Chttp://www.w3.org/TR/its20/
ITS and “Context”
• ITS 2.0 offers several data categories that can help with contextual information: Localization Note, Terminology, Id Value, Domain and Text Analysis.
• Quick glance at the first four,then in-depth look at Text Analysis.
Localization Note
Comments put in the source document and meant to be seen by the translators.
<msg its:locNote="%s is for On or Off">Click the %s button</msg>
Terminology
Annotates a “term” in the content and, optionally, provides additional related information.
<p>We need a new <span its-term=yesits-term-info-ref= "http://en.wikipedia.org/wiki/Motherboard">motherboard</span>.</p>
Id Value
Provides a way to associates unique IDs with parts of the content during translation.Can be useful for software text where IDs are often descriptive.
<its:idValueRule selector="//msg" idValue="@name"/>...<msg name="FILENOTFOUND">Not found</msg>
Domain
Allows to identify the general topic area of the content to translate.Can be useful for selecting MT engines.
<its:domainRule selector="/h:html" domainPointer="/h:html/h:head/h:meta[@name='keywords']/@content"/>...<meta name="keywords" content="automotive"/>
Text Analysis
• Annotates content with lexical or conceptual information.
• Useful for many things:– Term suggestion– General context information– Suggestion of things not to translate– Automated transliteration of proper names– Etc.
Text Analysis: An Example
Enrycher is an example of component generating Text Analysis annotations that can be easily integrated with translation tools or localization processes.
Motivation (2)
• There are specific rules to translate (or transliterate) proper names
• Solution: figure out what is actually being mentioned and see if any existing translated expression exists for that entity
Motivation (3)
• Examples: personal names, product names, or geographic names, chemical compounds, protein names
• Names and phrases appear in situations without sufficient context (UI labels, etc.)
ITS 2.0 Text Analysis
• Support text analysis agents that enhance content by suggesting or identifying concepts, identified by IRIs.
• A TextAnalysis annotates a text fragment with:– entity type– entity identifier– confidence
Text Analysis in ITS 2.0– what can it tell us?
• Does a text fragment represent some entity?– London is lovely in the summer.– Out of 73 known entities named London, we
mean a particular one: http://dbpedia.org/resource/London
• … a particular type of entity?– London is a phrase, representing a location
• … and with what confidence?
ITS 2.0 Text Analysis<!DOCTYPE html><div its-annotators-ref="text-analysis|http://enrycher.ijs.si/mlw/toolinfo.xml#enrycher"> <span its-ta-ident-ref="http://dbpedia.org/resource/London" its-ta-class-ref="http://schema.org/Place">London</span> is the <span its-ta-ident-ref="http://purl.org/vocabularies/princeton/wn30/synset-capital-noun-3.rdf">capital</span> of <span its-ta-ident-ref="http://dbpedia.org/resource/United_Kingdom" its-ta-class-ref="http://schema.org/Place">United Kingdom</span>.</div>
Producing these annotations
• Manual annotation• Automated NLP Techniques
– Named entity extraction & disambiguation – Word sense disambiguation
Use cases
• Informing a human agent (i.e. translator) that a certain fragment of text is subject to follow specific translation rules:– proper names– officially regulated translations.
• Informing a software agent (i.e. CMS) about the conceptual type of a textual entity in order to enable special processing or indexing
Named entity disambiguation – behind the scenes
• A difficult problem:– A name can refer to many entities, an entity can
have many names– Which interpretation is correct?
• Humans are pretty good at this1.We have prior knowledge on the ‘usual’ meanings2.We can glean the meaning from the context3.Things that are related, appear together
Named entity disambiguation – behind the scenes (2)
1. Prior knowledge: what is the most frequent meaning of ‘London’?
2. Context: someone using the word ‘London’ in the context of ‘Canada’ is likely to be referring to another London in Ontario
Named entity disambiguation – behind the scenes (3)
3. Relational similarity: things connected in the knowledge graph tend to appear together
Building blocks of Enrycher
• Token-level analysis– Sentence splitting– Tokenization– Lemmatization– Part-of-speech tagging
• Entity-level analysis– Named entity extraction– Co-reference resolution– Anaphora resolution– Named entity
disambiguation
• Document-level analysis– Sentiment analysis– Topic classification– Keyword extraction
(not used here)
Using Enrycher
• A HTTP service endpoint: send HTML5 in, get enriched HTML5+ITS2.0 out
• Multilingual: supports English and Slovene• See http://enrycher.ijs.si/mlw/, or try it from
the command line:$ curl -d "<p>Welcome to London</p>"
http://enrycher.ijs.si/mlw/en/entityIdent.html5its2<p>Welcome to <span its-ta-ident-ref="http://dbpedia.org/resource/London"
its-ta-class-ref="http://schema.org/Place">London</span></p>
Enrycher Integrated in Okapi
• The Okapi Framework is an open-source and cross-platform set of components designed to help building localization processes.
• One of its components is a client of the Enrycher services.
• Text Analysis annotations can be applied to any document in a format supported by the Okapi filters.
Translation Kit
Extraction Step
Enrycher Step
Trans-Kit Creation
Step
Enrycher Server
InputFile
OtherSteps…
XLIFF Terms
Term Extraction
Step
One example of usage ofthe Enrycher Web services
Enrycher Step
• Convert batches of segments (in Okapi’s internal format) into HTML paragraphs and send them to the Enrycher service.
• Converts back the annotated paragraphs into Okapi’s internal format.
• Next steps can use the Text Analysis metadata, e.g. XLIFF output, OmegaT comments, etc.
Term Extraction Step
• The Term Extraction Step offers various simple ways to guess terms in a source content.
• One of its methods is to re-use the content annotated with the Text Analysis metadata to feed the list of term candidates.
Questions?
• Enrycher:http://enrycher.ijs.si/
• Okapi Framework:http://okapi.opentag.com/
• ITS 2.0:http://www.w3.org/TR/its20/