Applicability and application of machine translation quality metrics in the patent field

at SciVerse ScienceDirect

World Patent Information 35 (2013) 115e125

Contents lists available

World Patent Information

journal homepage: www.elsevier .com/locate/worpat in

Applicability and application of machine translation quality metricsin the patent field

Laura Rossi a,*, Dion Wiggins b

a LexisNexis Univentio, Galileiweg 8, 2333 BD Leiden, The NetherlandsbAsia Online Pte Ltd, 30 Robinson Road, #02-01 Robinson Towers, Singapore 048546

Keywords:Machine translation, MTQuality metricsBLEUHuman evaluationPatent translation contents

* Corresponding author. Tel.: þ31 88 639 0000, þ3E-mail address: [email protected] (L. Ros

0172-2190/$ e see front matter � 2012 Elsevier Ltd.http://dx.doi.org/10.1016/j.wpi.2012.12.001

a b s t r a c t

This article provides a guide for readers interested in deepening their knowledge of the strengths andweaknesses of the different machine translation (MT) quality metrics, and presents a methodology andtooling developed respectively by LexisNexis� and its MT provider Asia Online� as part of a humanquality assessment framework for patent translation. The methodology is designed specifically tocompensate for the shortcomings of automated evaluation.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

The constant increase in the volumes of translation work,pressure on turnaround times and cost reductions [1] has raisedgeneral public awareness of the possibilities provided by machinetranslation (MT) technology to disclose foreign content. Somecredit for this awareness can be given to the, until recently, freeofferings of some big companies’ tools such as Google Translate�and Microsoft� Translator.

The patent information industry discovered some time ago thepower of MT for unlocking patent publications available only intheir original language. The ability to search for inventions in othercountries and in foreign languages, as well as to scan through theircontent, has become an essential and frequently used tool [20e24].

The adoption of MT technology for patent translations has led tothe need to measure quality of translation output, both betweeneach version of a single system and across multiple translationsystems. In recent times, there has been much discussion withinthe patent information industry about BLEU scores and otherautomated quality metrics. However there is some concern overthe following:

- Is there awareness and understanding of what MT automatedmeasurements are meant for and of how they can beinfluenced?

- Are evaluators of MT output aware of the correlation betweenthe usage of specific MT technologies and settings and theresults of automated metrics?

1 88 639 0065.si).

All rights reserved.

- Are automated metrics enough for defining the acceptabilitylevel of a system or is it human usage and perception, in theend, which determines the suitability of MT results to a specificpurpose?

- Is it possible to define absolute criteria of ‘good quality’ thathumans can use for scoring translations, or do these criteriahave to be made dependent on and be prioritized according tothe final goal of the translation effort?

- What is, at all, the suitability of the patent text typology in itscurrent form and style to translation and evaluation via auto-mated tools?

2. Patents and machine translation

2.1. Patent language and intrinsic limitations in MT final quality

Patents are highly technical in nature and make use of legalestablished and conventional formulations. This could lead somepeople to conclude that patents are a good candidate texttypology for MT. In reality, patent data presents numerous chal-lenges to MT researchers, developers and commercial users.Those challenges are not so much represented by the specificityand technicality of the terms, although the novel nature ofpatents is a daily source of neologisms, but more by the extremesyntactic complexity of patent sentences, which make massiveuse of nominal style, relative clauses, formal constructions andhuge long-distance relationships among constituents. In thefollowing sentences from German application DE19927708A1,these typical features are quite well-exemplified. For betteranalysis, a double pipe is used to separate the different clauses inthe sentences.

Delta:1_given name

Delta:1_given name

Delta:1_given name

Delta:1_given name

Delta:1_surname

Delta:1_given name

Delta:1_given name

Delta:1_given name

mailto:[email protected]

www.sciencedirect.com/science/journal/01722190

http://www.elsevier.com/locate/worpatin

http://dx.doi.org/10.1016/j.wpi.2012.12.001



L. Rossi, D. Wiggins / World Patent Information 35 (2013) 115e125116

In this first sentence, it can be observed how the SVO (subjecteverbeobject, in this case an indirect object) basic structure used bythe German language is inverted in the main clause because ofstylistic reasons, and how the distance between subject (under-lined) and verb (bold) in the subsequent subordinate clauses isquite long and made more complex by the use of in-betweenparticipial constructions (i.e. mit einem vorzugsweise mit einer grö-ßeren Flächenerstreckung ausgeführtem Borstenfeld).

Dem Benutzer der erfindungsgemäßen Bürste steht daher auf-grund der mehreren Borstenfelder mit Borsten mit verschie-denen Härten eine Bürste zur Verfügung, jj bei der einBorstenfeld mit härteren Borsten zur Beseitigung härtererBeläge vorhanden ist, jj während der Benutzer eine Reinigunggroßflächiger Bereiche der Zahnprothesen mit einem vorzugs-weise mit einer größeren Flächenerstreckung ausgeführtemBorstenfeld der erfindungsgemäßen Bürste mit weicherenBorsten durchführen kann, jj so dab eine Beschädigung derZahnprothesen durch den großflächigen Einsatz von hartenBorsten ausgeschlossen ist.

In this second example it could be seen how clauses are madedependent on each other via relative pronouns (underlined) andmade more complex by the use of coordination (bold) referringrespectively to the preceding relative and consecutive clause.

Die erfindungsgemäße Bürste kann ein zweites Borstenfeldbesitzen, jj welches ein drittes Borstenfeld weitgehend eins-chliebt, jj welches Borsten aufweist, jj die weicher sind als dieBorsten des ersten Borstenfeldes jj und härter sind als dieBorsten des zweiten Borstenfeldes, jj so dab ein kleineres Bor-stenfeld mit härteren Borsten von einem größeren Borstenfeldmit weicheren Borsten umgeben ist jj und damit auch ein etwashärteres Borstenfeld zu einer großflächigen Reinigung vonZahnprothesen zur Verfügung steht, jj bei dessen Einsatz esaufgrund des dieses Borstenfeld umgebenden Borstenfeldes mitweichen Borsten nicht zu einer Beschädigung der Zahnprothesedurch die weichen Borsten kommen kann.

The freedom allowed to patent drafters in their explanationsand the still scarce usage of patent-specific authoring tools, such asLexisNexis� PatentOptimizer�, which help unify language and style,create consistency and minimize errors, make patent publications,from the MT perspective, a ‘jungle’ of convoluted text.

Patent language breaks most of the rules dictated by years ofresearch in the Controlled Language space, meant to simplify thelexical, syntactic and textual aspects of a document for both betterhuman intelligibility and enhanced machine processing. Table 1shows some examples of this non-compliance of patents tomachine-translation oriented Controlled Language rules (fora more comprehensive overview of Controlled Language rules, seeO’Brien [2]).

The historical variability of patents adds to this complexity indifferent ways:

- Very old documents are available in digital form only thanks tothe use of OCR (optical character recognition) technologies.OCR quality depends on different factors, among which thequality of the initial scanned images and the support fora specific source language. Scans performed with older scan-ningmethods can have a lower degree of accuracy compared tomodern ones [3]. As a consequence, the OCRed version of thiskind of documents that get submitted to MT may containmistakes.

- Language evolves and changes occur in grammar, spelling, styleand usage of words. The following examples fromDutch patentNL41884 published in 1937 show respectively an obsolete

grammar construction (‘der wond’ is ‘van de wond’ in modernDutch) and samples of old-fashioned spelling (‘versch vleesch’,‘vleeschbrij’ and ‘Nederlandsch’ would be ‘vers vlees’, ‘vleesbrij’and ‘Nederlands’ in modern Dutch).

- Voor het dichtnaaien van door operaties veroorzaakte won-den maakt de chirurg gebruik van een bijzonder naaimater-iaal, hetwelk ten doel heeft de randen der wondmet elkanderin aanraking te houden, totdat zij geheel is dichtgegroeid.

- Ook is reeds voorgesteldverschvleesch, pezen, huidenof derg.op een andere manier voor te behandelen, nl. tot eenvleeschbrijfijn temakenendezemassadoormondstukken totdraden te persen. (Verg. als voorbeeld het Nederlandsch O.S.23164)

The impact on quality: This complexity needs to be taken intoconsideration when building a reference test set for evaluation ofMT results and when analyzing the results of automated or humanquality metrics. The results of the metrics could in fact say moreabout the quality and intelligibility of the source text, than aboutthe quality of the MT systems producing them, if the reference testsets are not built through a careful and conscious analysis, asexplained in Section 3.1.4.

Moreover, MT evaluators who are busy with benchmarkingdifferent systems should be aware of the fact that some systems arebuilt to support these specific patent patterns, others have thepotential and the flexibility to be customized in order to do it andsome cannot grow in any other way than by extending theirdictionary coverage.

2.2. Quality as a goal-dependent variable

In the patent field, MT is used as a support tool for performingnovelty, validity, infringement or state-of-the-art searches, and toprovide a first understanding of the content of retrievedpublications.

The impact on quality: Considering the preceding statement, itbecomes clear that two aspects are key to the evaluation of MTresults, that is to say searchability and readability.

We define MT searchability as the combination of:

- terminology completeness- terminology correctness

With readability we refer to the fluency of a translated textdetermined by:

- the order of phrases in a sentence- the correctness of their logical connections.

Morphological and morphosyntactic errors (verb tense, subject-verb concordance, nouneadjective concordance, singular versusplural form) are not really relevant in terms of general ‘gisting’ andcross-language retrieval.

2.3. Interdependencies between quality evaluation process and MTapproach

Most of the patent offices and providers of patent search toolshave adopted MT by following either of these approaches orsometimes a combination of the two:

1. Using MT to translate keywords or phrases from the languagespoken by the user to different languages, so as to be able tolook for the same concept across different authorities

Table 1Compliance of patent language to common Controlled Language rules.

Rule Rule notes Compliance In patents.

Sentences should be short Around 25e30 words Patent sentences are usually very long, a singlesentence can be up to 500 or more words.

Sentences should begrammatically complete

Sentences should not be writtenin telegraphic or nominal style,they should not lack a subject ora verb

Patents often make use of nominal or telegraphicstyle, especially in titles.

Sentences should use theactive form

Passive forms are more difficultfor a MT system to deal with

Passive forms are preferred in patents as in otherscientific texts.

Sentences should have asimple syntax

The usage of gerunds, pastparticiples, relative clauses andother implicit constructions is asource of ambiguity for MT

Patents make use of formal and legal language,which is rich in gerunds, participles, relative andother subordinate clauses.

Sentences should expressone idea only

Combining more ideas in the samesentence has the natural consequenceof complicating its grammatical structure

Patent abstracts, some parts of descriptions andclaims concatenate many ideas in the same sentence.

Spelling should be correct Incorrectly spelt words are notmatched against any of the words whichmight be found in a rule-based MT systemdictionary or in a statistical machinetranslation phrase table

Spelling mistakes are inevitable in the authoring ofsuch long and complex texts.

Lexical choice should belimited to a pre-defineddictionary, synonymyshould be limited,polysemy avoided

When words are known to the MT systems,quality and fluency both increase. The moredifferent the words chosen in the sourcetext, the higher the chance that some ofthose are not known to the system

The patent terminology scope covers all areas of humanknowledge and is exponentially growing, due to the‘innovative’ nature of patents. What is described inpatents is ‘new’ by definition, and neologisms are oftencreated in order to express specific new concepts.It is also not uncommon to find multiple concepts in thesame domain indicating the same (synonymy) as well asthe same concept across different domains with differentmeanings (polysemy).

Punctuation should beused carefully andmeaningfully

Commas, colons, semicolons, full-stopsshould be strictly used according to therules of the source language

Use of punctuation is often not correct in patent texts,causing ambiguity in interpretation.

L. Rossi, D. Wiggins / World Patent Information 35 (2013) 115e125 117

publishing in different languages. This can be defined as ‘queryMT’ and focuses on using MT for searching purposes.

2. Using MT at the tail of the process to make a publicationreadable for a searcher who is not familiar with its originallanguage, therefore only after the same publication has beenretrieved by using criteria other than text search (i.e. patentclassification, publication dates, kind codes, etc.). This can bedefined as ‘document MT’, and focuses on providing the userwith a first ‘gisting’ of the publication content.

When deciding on its MT strategy, LexisNexis� first identifiedsome main shortcomings of the approaches described above:

1. Query MT, which we describe as the application of MT tech-nology only to a list of user-inputted key terms or phrases,does not use the sentence context for disambiguating specificterms: context is required by MT systems to choose semanti-cally appropriate translations in presence of polysemic wordsand homographs. MT alone might not be sufficient in thisapproach to perform the required disambiguation. As alsostated by Magdy and Jones [4], ‘when using MT for CLIR,’(Cross Language Information Retrieval) ‘longer queries are

preferable since they tend to be more grammatical, thereforebetter translations can be achieved using an MT system takingcontext into account, leading to better retrieval effectiveness’.This is also demonstrated by the rich research in CrossLanguage Information Retrieval, which explores complemen-tary or alternative ways to MT for optimizing search acrossmultiple languages [5e7]. The Cross Language EvaluationForum (CLEF) offers a good overview of the new trends in thisfield (see http://clef2011.org/index.php?page¼pages/labs_program.html).

2. Query MT is used for searching purposes and does not deliverany translation of the entire content of a publication. The resultof a search with query MT is a document in its originallanguage.

3. Document MT cannot be used for searching purposes, but onlyfor a rough understanding of the content. When using docu-ment MT only after retrieving a publication, the searcher has tofully rely on other criteria and cannot make use of the strengthof full text search, other than, mainly, for English-languagetitles or abstracts. The potential of full text search, asa complementary tool to classification search, is very wellknown to the patent searchers’ community [8].

http://clef2011.org/index.php%3fpage%3dpages/labs_program.html




4. Even when used in combination to cover their relative short-comings, the results of query or document MT would not beuseful for LexisNexis in the integration with the existingsemantic search technology.

5. MT could be used to translate the search criteria into the targetlanguage, search and then return the results with a real timetranslation, but this would require the ability to search accu-rately in languages such as Chinese where there are additionalcomplexities [9,10].

Considering the above, LexisNexis opted for a third approach:the translation and storing in its database of all text elements(descriptions and claims, but also titles and abstracts where notavailable) in the English language for all publications. ‘Batchtranslation’ combines the strengths of query and document MT byproviding:

- the possibility of searching for a patent using English asa common language among all authorities

- the possibility of applying semantic search on English textacross authorities

- the possibility of getting a first rough idea of the contentimmediately at retrieval

- the ability to check upfront whether a publication cannot betranslated because of any technical issue and to solve the issuebefore lots of users do by clicking on the real-time documenttranslation button

- the ability to review areas of improvement for translation.

The impact on quality: Batch translation, however, has thedrawback of being time-consuming and therefore less flexible inthe incremental implementation of enhancements: replacinga batch of translations with a better output when a new version ofthe system is released requires a careful planning. It soon becomesclear that, especially in the case of batch translation, due to theconsiderable amount of time needed for completing it, qualitymeasurements have to be defined and regularly performed in orderto monitor development changes before deployment. The samemethodologies can be applied to document MT, but are less usefulfor judging the quality of query MT, which compares more to theway multilingual indexing for cross-language retrieval works andcould better be tested by standard measures of precision and recall.

The MT strategy one adopts has consequences in the way MTquality measurements should be performed and integrated ina production environment.

3. Measuring MT quality

3.1. Automated evaluation

3.1.1. Fundamental assumptionsThe fundamental assumption most MT automated scores are

based upon is that a machine-translated text is regarded as goodthe more it resembles a human-translated reference.

For the calculation of automated scores, the machine-translatedcandidate is therefore compared to one or more human-translatedreference texts.

Automated metrics were originally built for monitoring devel-opments on a specific MT system for a given language pair, and forconfirming that changes applied to that system were bringing it toa higher level of quality. Their scores cannot be used as such tomake any absolute statements about the general quality level or tocompare MT quality across different language pairs. They can beuseful, however, in benchmarking different systems, provided that:

- the same domain and language pairs are taken intoconsideration

- the same test set is used to calculate the scores- all of the criteria for building the test set are valid for all of thesystems

- any bias of the metrics towards either the rule-based orstatistical MT technology is correctly considered in the finalevaluation.

3.1.2. MetricsThe following overview aims at providing a short explanation on

how the most well-known and widespread automated qualitymetrics currently used by theMT industry are calculated, what theyactually measure and how they should be interpreted.

- BLEU [11]: is a modified precision metric that measures howmany words in the machine-translated candidate text ‘overlap’with the reference text(s). To be more precise, the modifiedprecision algorithm does not take into consideration singlewords only, but n-grams, that is to say contiguous sequences ofn word items in a text. The choice of n-gram comparisoninstead of unigram comparison gives a specific importance towords which are in the same sequential order as in the refer-ence text. MT could in fact generate fully incomprehensibletranslations with all necessary and correct words, but in thewrong order. Taking into account n-grams for the calculationavoids high scores for these cases. BLEU is generally consideredto be much more a measure of lexical ‘accuracy’, whichpartially covers ‘fluency’ as well, but it is no measure of textintelligibility or grammar correctness. The score is normallymeasured on a scale from 0 to 1, where 1 is a perfect match. Tosimplify communication, the score is often expressed as a valuebetween 1 and 100, which should not be confused witha percentage of accuracy.

- NIST [12]: the NIST evaluation is very similar to the calculationof the BLEU, measuring both ‘adequacy’ and ‘fluency’, butadding the concept of ‘informativeness’. The n-gram co-occurrence score is modified by weighing more heavily n-grams that are considered ‘more informative’ (i.e. lessfrequently occurring n-grams). Once again, the focus is onlexical correctness and concept rendering, more than on textintelligibility.

- METEOR [13]: this metric tries to address some weaknesses ofthe BLEU score by measuring both precision and recall,assigning penalties to non-adjacent text chunks rather than byn-gram brevity as BLEU does, and using stemming algorithmsand a synonym dictionary for taking into account word varia-tions. The calculation of recall is meant to address the‘completeness’ of translation compared to the reference text(s),next to its ‘adequacy’ that is measured by the precision count.The fact that METEOR requires language-specific componentssuch as stemming algorithms and synonym dictionaries makesits usage restricted to some languages only, for which thesecomponents are available and easy to integrate.

- F-Measure [14,15]: F-Measure is defined as the harmonic meanof precision and recall. In its application on MT resultsdescribed in Ref. [14], precision and recall are calculated byusing the maximum matching size (MMS) for a particularbitext, which is another approach to compute the intersectionbetween the candidate and the reference text that avoidsdouble-counting. Moreover, in the same application, longer‘runs’ (contiguous sequences of matchingwords) are rewarded.As for the preceding metrics, even F-Measure is an indicator of


‘accuracy’, ‘fluency’ and ‘completeness’ (thanks to the calcula-tion of the recall), but still no measure of understandability.

- TER [16]: TER, or Translation Edit Rate, measures the minimumnumber of edits required to change a machine translatedsegment (called hypothesis) output in order to make it as oneof the reference texts, normalized by the average length of thereferences. Edits include insertion, deletion, substitution andshifts of words and are of equal cost. Intuitively, for thismeasure, the lower the score, the better the quality. The TERscore is limited by its lack of notions of semantic equivalence:a hypothesis might convey the exact same meaning of thereference text, but still be given a high TER score because of itsvery different syntactic structure and word choice.

3.1.3. Reference texts and their impact on metricsAs explained in Section 3.1.1, the output of automated metrics

cannot be considered as a measure of absolute quality. The scoresare strongly influenced by a series of factors that have to be takeninto consideration when creating a reference test set for evaluationand when interpreting the results.

All of the above metrics compare a human translation to rawmachine translated output. While each has their strengths andweaknesses, many of the metrics become less meaningful if the testset is not carefully selected for the specific purpose of measure-ment. For example, if a test set is full of sentences that are 500words long, then the complexity of the sentence will override anyability to understand the metrics. Hereunder some of those factorsare listed, together with a short explanation of their impact theymight have on the scores.

- Blindness of test set: a test set must be blind. This means thesentences included in the test set must be removed from thetrainingmaterial in the case, for example, of statistical machinetranslation systems; if not removed, one would be evaluatingrecall of an existing text and not the machine translationresults.

- Language pair: assuming a comparable amount and compa-rable quality of training materials for statistical machinetranslation and a unique target language (i.e. English), auto-mated metrics give very different scores for the translationoutput from different source languages. Asian to Englishlanguage combinations can score much lower than WesternEuropean to English language combinations. This can indeed bea reflection of the fact that state-of-the-art rule-based andstatistical MT technologies still face strictly linguistic issues.These have to do especially with word reordering betweenlanguages with very different sentence structures and withdisambiguation of information where the source languagelacks morphological richness (i.e. indication of singular andplural or subjecteverb concordance). Expectations on scorescannot realistically be represented by any default value acrosslanguage pairs.

- Number of reference texts: there are multiple ways of expressingthe same concept, multiple words for defining an object andmultiple construction modes for the same sentence. In manycases none of these ways can be considered to be better thanothers.MTsystemsmightbepenalizedbyautomated scores, evenwith a fully correct output, when said output is very different insentence constructionor synonymchoice from the reference text.If multiple reference texts are available, scores might be higher,because more of these possibilities can be covered.

- Quality of human translated references: there is almost no doubtthat human translation is a better product than machinetranslation; nevertheless, even human translators can make

mistakes. Some translators, moreover, are better than others ininterpreting the source text and rendering it in the targetlanguage. Although the test set should comprise only ‘goldstandard’ human translations, quality can vary considerably.

- Syntactic complexity of the sentences in the test set: MT performsat its best with source text that is ‘controlled’ in vocabulary,grammar and style. Patents make knowingly use of veryconvoluted sentence structures and neologisms [17]. Whenchoosing an oversimplified test set for the evaluation of MTresults, the output of the evaluation might be biased. Thesystem will probably not perform as expected on morecomplex, ‘everyday’ patent sentences.

- Length of the sentences in the test set: similarly to the previouspoint, shorter, simple sentences have a higher chance ofscoring higher than longer, complex ones. The longer thesentence, the more ways the same concepts can be expressedin the translation (and, as a result, the less chance that thewords in the MT segment will match the reference).

- Number of sentences in the test set: a short test set has littlerelevance as there is not enough information to measure. Inorder for the test set to be statistically significant, a test set forautomated measurements has to comprise at least 1000sentences.

- Text typology used in the test set, in terms of domain and style: thetest set has to correspond, in domain and style, to the goal ofthe specifically customized/trained MT system. If different MTsystems have been trained for supporting documentsbelonging to, e.g. different IPC Sections, each different domainshould have its own test set. It is also interesting to note that, ifa statistical patent MT system is trained on only titles andabstracts, different results might be obtained when evaluatinga test set containing sentences from only titles and abstractsversus a test set containing also sentences from descriptionsand claims. LexisNexis chose to include samples from all of thetext elements (titles, abstracts, description, claims), despite thefact that the systems were mostly trained only on availablebilingual elements (titles, abstracts, sometimes claims), sincethis approach would give a fair idea of the overall qualityobtainable when translating the complete full text.

- Volume, quality and domain-pertinence of training materials:especially with regard to statistical MT, the amount of trainingmaterial available for a language pair, its quality and domain-pertinence will have an impact on final evaluation results.

- Translatability of test set segments: the test set should notcontain large numbers of segmentswith non-translatable itemssuch as mathematical formulas. Test set segments should bebased primarily around words that are to be translated.

3.1.4. Building a reference text for patentsBuilding reference texts for patent translations is not as easy as

it might seem and it might be costly as well:

- Ideally one would need to create a test set for each technicaldomain in order to make sure that the MT quality level isconstant across all of the domains; IPC classification offersa good reference matrix to identify domains of expertise: howdeep, however, would one need to go in the classification tree?IPC Section seems to be a fair compromise and a realistic goal; ifa single test set is created for ‘patents’ in general, it would begood practice to make sure that it contains random sentencesfrom publications belonging to all of the IPC Sections in orderto average the scores of the different domains. The availabilityof bilingual materials for statistical machine translationtraining or of glossaries to customize rule-based solutions


varies a lot, in fact, per domain and per authority, as well as thecomplexity of some technical texts (i.e. the translation ofcompounds in the chemical space).

- For the same reason as for domains, all text elements should becovered by a test set: style and syntactic complexity varyconsiderably among titles, abstracts, descriptions and claims. Ifthe test set is used to perform any human quality evaluation aswell, one might want to consider including sentences showingsome specific technical features, such as mathematical orchemical formulas and invention part number references. Onlya combination of all of these elements can give a realisticexpectation of general quality.

- For the building of a test set, high-quality human translationsare needed. Where they already exist in the patent space, theyare mostly for titles and abstracts only and often no one-to-onetranslations, but rewriting of the same content in another form.Moreover, they are sometimes clearly written by non-nativeEnglish speakers, and this makes them not apt to the purpose.Where human translations do not exist for this purpose, theyneed to be created by specialized translators.

- For simplicity, the historical variable is not considered. MT isa relatively young technology, and therefore developed forlanguage in its most modern form. Spelling and words mighthave changedwith the time and older ways of writingmight be‘unknown’ to the system. Unless the translation system hasbeen built by keeping this variable in mind, older documentsshould not be selected for the reference test set. If segmentsfrom older documents are included in the test set, one shouldaccount for possible lower scores.

The test set used by LexisNexis for a specific language pairconsists of:

� 1000 sentences in the source language, where sentences:- are selected across different domains of expertise (differentIPC Sections)

Fig. 1. Asia Online automat

- are selected across different text types (sentences belongingto titles, abstracts, description and claims)

- are selected among the most recent documents, for elimi-nating the historical variable, out of simplification reasons

- vary in syntactical complexity and length- contain some patent-specific patterns, such as chemicalelements/formulas or mathematical formulas

� a single human reference translation per sentence, where thetranslation:- is not necessarily originated by the same translator- is sometimes derived from existing translations (i.e. highquality abstracts)

- has sometimes been newly generated (i.e. description orclaim sentences)

- is rigorously eliminated from the training corpus for statis-tical MT thus avoiding any bias.

3.1.4.1. Automated metrics at LexisNexis.At LexisNexis, case insensitive BLEU has been adopted as theautomated score, since it is the most well-known and widespreadin the IP world. Also, while important, LexisNexis is more con-cerned with the vocabulary used in the translations than withcapitalization. F-Measure and TER are sometimes calculated next tothe BLEU, in order to confirm the validity of the BLEU results.

The scores are used first of all as an internal development tooland as a means of internally benchmarking the LexisNexis systemagainst other commercial or free products, when possible.

A software suite developed by Asia Online� enables LexisNexisto generate reports based on the selected quality metrics (Fig. 1).

3.1.5. When is quality ‘good enough’?Before beginning with development or enhancement of

a system, an ‘expected score’ is defined, based on the consider-ations of Section 3.1.3. The language pair, the type of test set and thecomparison with the average scores reached at that moment in

ed quality metrics tool.


time by other MT systems of different technology (both rule-basedand statistical) on the same test sets, helps LexisNexis to builda realistic reference framework. The score reached by the newlydeveloped or enhanced system has to be considerably higher thanthe one reached by the existing system in order to justify a changein the production process.

During development of the first JapaneseeEnglish engine, forexample, LexisNexis defined a BLEU score of 35 as a minimumexpected quality level for the type of test set that had been built.The quality delivered by other commercial and free MT systems forJapanese at the time of development was rated betweena minimum score of 9 and a maximum score of 18 BLEU points.

Whereas the ‘expected score’ aims at defining a threshold ofacceptability for entering production with a newly developed orenhanced system, we believe it still remains an automated metric.As such, the outcome of this one metric alone will not be enough todecide production-readiness. Automated measures can be used asa good indication of terminology correctness and terminologycompleteness (and therefore of searchability). However, they arenot as reliable in quantifying readability aspects. Common read-ability metrics, such as FlescheKincaid [18], have been considered,but they are not much of a help, since they are based on the coreconcepts of word length and sentence length, that, by definition,are typical of patents. Moreover, they apply on only one languageand they cannot provide any informationwhether the target text ofa translation respects the logical connections of the source sentenceconstituents. This can only be determined by people skilled in bothsource and target language, or at least people with knowledge ofthe target language and provided with a reference translation. Thecurrent state of the art in translation quality measurement researchdoes not provide any well-consolidated fully automated tool toperform such tasks.

3.2. Human evaluation

With the goal of testing the correlation between the qualityacceptability in terms of automated scores and of human percep-tion, and in consideration of the shortcomings of automated eval-uations especially with regard to semantic rendering of the contentand word order/readability, LexisNexis has developed a humanevaluation framework, aiming at analyzing the following items:

1. Terminology2. Missing information3. Added, non-pertinent information4. Word order

The first three categories (which cover respectively correctness ofterminology, completeness of information and redundancy of infor-mation) aim at assessing the adequacy of the system for searchingpurposes. The second and third categories, in particular, addressissues that are quite common in the statistical machine translationspace, namely thedroppingof important pieces of informationor theadding of non-pertinent ones in the machine-translated text.

The fourth category (word order) aims at assessing general textreadability. The fluency and readability of a text are mostly deter-mined, from our human perspective, by the order of the phrases ina sentence and the correctness of the logical connectors.

LexisNexis felt the need to specify even further what the fourcategories would exactly cover in order to avoid any ambiguity inthe interpretation of the evaluation rules:

1. Terminology:a. Wrong terminology does not mean synonyms or morpho-

logical variants (singular versus plural) but only wrong

semantic renderings of a specific term. (i.e. the referencehuman translation is ‘traffic control method’, whereas MTproduces ‘communication control means’).

b. Unknown terminology includes words or phrases flagged as‘unknown’ and left in the source language or transliterated.

c. Clear terminology means that the terminology choice per-formed by the system does not generate any ambiguousinterpretations and correctly transfers the meaning of thesource term.

2. Missing information:a. Missing information identifies text which is present in the

source or reference English translation, but not in themachine translation, andwhich is essential to the transfer ofmeaning between the source and target language and,therefore, to the intelligibility of themachine-translated text(missing articles, for example, do not fall into this category;missing prepositions could, however, be recognized as‘missing information’, whenever their presence is necessaryto transfer the correct meaning in the translated text).

3. Added/non-pertinent information:a. Added/non-pertinent information refers to an issue that is

quite common in statistical machine translation enginesand that originates from incorrect word segmentation:some non-pertinent words or phrases are added to thetranslation, but they do not correspond to any text in thesource and reference translation. (i.e. the reference humantranslation is ‘Items K1eK4 are sorted in ascending order’,whereas MT produces ‘Items K1eK4 obstinacy are sorted inascending order’ The word ‘obstinacy’ is non-pertinent andincorrectly added to the MT output).

4. Word order:a. Wrong word order does not mean ‘different from reference

text’, but ‘causing the message to be semantically differentfrom the one conveyed by the original or reference text’.

LexisNexis decided, in first evaluation phases, to ignoremorphology and morphosyntax errors, because they were notrelevant in terms of gisting and cross-language retrieval.

Human evaluation is performed on the same test set for whichautomated scores are calculated. Due to time constraints, only textgenerated with the system developed at LexisNexis is analyzed.Third-party translations are not included.

A minimum of one internal and one external evaluator isrequired in order for the quality assessment to be as objective aspossible. Human evaluators are selected among people withknowledge of translation processes, of technical documentationand, preferably, with experience in the MT field. Translators are notchosen for this task, unless familiar with MT technologies, sincethey are often too strict in judging the linguistics of MT results,possibly losing the primary focus of searchability and ‘gisting’.

3.2.1. Evaluation toolingIn order to facilitate the scoring operations, Asia Online provides

a quality assessment tool that enables the users to define the errorcategories in a configuration file, to specify whether a specific errorcategory should be measured with reference to the machine-translated candidate text (i.e. added/non-pertinent information)or to the source/reference text (i.e. missing term) and whether theerror count per category needs to take into consideration singlewords or to group them in a single ‘error’ (see Fig. 2)

When the evaluator underlines the erroneous words in thetranslation table, the software tool automatically computes thenumber of errors, based on the information of the configurationfile, and assigns the underlined words a color-coding. One or moreerror categories can be assigned to the same words, if applicable.

Fig. 2. Asia Online human quality assessment tool e configuration file.

Fig. 3. Asia Online human quality assessment tool e editing window.


LexisNexis evaluates categories 1, 2 and 3 (Terminology, MissingInformation, Added/Non-Pertinent information) via the qualityassessment tool; word order is evaluated separately in an Excel�

file, because of the difficulty of manually and consistently identi-fying ‘shifts’ in sentences. The evaluators are required to score eachsentence for word order as defined in Section 3.2.2.

Moreover, three extra categories are included in the configura-tion file, with the goal of excluding the herewith marked sentencefrom the calculation of an average score.

5. Spelling mistake/Possible OCR mistake6. Error in reference text7. Cannot understand reference text

Category 5 is required for excluding OCR materials unwillinglyincluded in the test set. This ensures that the MT system is notpenalized unfairly. The same is valid for category 6 (errors in thereference text): even the human translation reference mightcontain mistakes. Finally, category 7 is needed to avoid bad scoring

for MT evenwhen the quality of the reference human translation isso compromised that it causes the evaluators not to understand it.That would point, in fact, to an unclear source text, more than to anerror of the MT system.

Figure 3 shows the editing window of the Asia Online humanquality assessment tool. The columns display, from left to right, thenumerical index of the sentences in the test set, the original Japa-nese text, the human translated English reference, the machinetranslation of the same text, the total number of words in themachine translated text and the counts of words identified by theevaluators as errors according to the defined categories, and withthe same color coding as in Fig. 2.

3.2.2. Evaluation scoresEvery sentence is assigneda score perdefined category (beside for

categories 5, 6 and 7 that are just meant for excluding erroneoussource, training materials or reference translations from the finalaverages), according to the scheme inTable 2 (inspired by aMicrosoftMTevaluationwhite paper [19] andmodified for this specific usage).

Table 3Correlation of error percentages to sentence scores.

For PW, PM and PA Score

�30% 1�20% and <30% 2�10% and <20% 3<10% 4

Table 5Acceptance level based on human assessment.

Acceptance level Score

Unacceptable Average of final scores <3Acceptable Average of final scores �3

Table 4Guidelines for human scoring of sentence word order.

For WO Score

Word order is so compromised that it is impossible to understand themessage if not by reading the reference text, even if terminology iscorrectly rendered or word order is so compromised that textinterpretation is misleading (i.e. specifications are wrong as in thesecond example: A of B versus B of A, as in ‘the filter of the pump’versus ‘the pump of the filter’)

1

Word order is correct only in sentence chunks, but overall sentencestructure is still compromised. It is possible to understand whatthe sentence is about, however the interrelations of phrases arenot completely coherent and understandable.

2

Errors in word order more or less between 10% and 20% of shiftedwords on total sentence length

3

MT and reference text convey the same message. 4

Table 2Sentence scoring per error category.

Category Score

Unacceptable 1Possibly acceptable 2Acceptable 3Ideal 4


In order to make the scores as objective as possible, thefollowing is agreed:

Given per sentence:

� t ¼ total number of words in the machine-translated text� w ¼ number of wrong words

Table 6Acceptance matrix based on automated and human judgment.

BLEU Human evaluation Final evaluation

Acceptable Acceptable AcceptableUnacceptable Unacceptable Unacceptable

Acceptable Unacceptable Unacceptable

Unacceptable Acceptable Acceptable

Unacceptable

� m ¼ number of missing words� a ¼ number of added words� T ¼ total number of words (including correction factor m)� P ¼ percentage weight of one word in the translated text� PW ¼ percentage of wrong words� PM ¼ percentage of missing words� PA ¼ percentage of added words� WO ¼ word order

the following is calculated:

� T ¼ t þ m

� P ¼ 1T

� PW ¼ wT

� PM ¼ mT

� PA ¼ aT

The error percentages are then coupled to a score as in Table 3.With regard to the human scoring of word order (WO), some

guidelines have been developed for the evaluators to follow (seeTable 4).

An average score is calculated per error category with the goal ofperforming root-cause analysis (i.e. identifying the lack of trainingmaterials or domain terminology in some fields).

A final score is assigned to each sentence as being the lowestscore among PW, PM, PA andWO and an average final score for thetest set is the calculated. Only this general score is used for furthercalculations. The general acceptance level for LexisNexis iscomputed as the average of the final scores by the singleevaluators.

Sentences containing spelling/OCR mistakes, errors in thereference text or having an incomprehensible reference translation(error categories 5, 6 and 7) are eliminated from the scoring set.

Finally, with the lowest score among PW, PM, PA and WO being<3, the corresponding sentence is flagged as ‘Unacceptable’ (seeTable 5). An acceptance percentage is calculated and root causesanalyzed, by looking at the average scores of the single evaluationcategories.

The results of this human evaluation help determine theproduction-readiness of a system, in case of inconsistent resultsbetween automated and human evaluation (see Table 6).

- Should the BLEU score reach the defined threshold but thehuman evaluation be lower than the established level, thequality is considered Unacceptable. In this case, an analysis will

Initiated actions Production-ready

Engine is production-ready YesRoot-cause analysisCorrective actions

No

Root-cause analysisCorrective actions

No

If BLEU is reasonably near the threshold(max 5 points lower)Engine is production-readyRoot-cause analysisCorrective actions

Yes

If BLEU is more than 5 points lowerthan thresholdRoot-cause analysisCorrective actions

No

Table 7Sample human evaluation on a test of 1000 sentences for JPeEN.

Average scores

SearchabilityAverage score based on PM 3.866Average score based on PA 3.942Average score based on PW 3.956Average score based on PM,

PA and PW3.921

Average lowest score basedon PM, PA and PW

3.776

Final average score based onPM, PA and PW, where sentenceswith spelling mistakes, errors inreference text, incomprehensiblereference texts have been excluded

3.921

Final average lowest score based onPM, PA and PW, where sentenceswith spelling mistakes, errors inreference text, not incomprehensiblereference texts have been excluded

3.773

Acceptance percentage for searchability categories 94.36%ReadabilityAverage score based on word order 2.495Average score based on PM, PA, PW

and WO3.565

Average lowest scores based on PM,PA, PW and WO

2.438

Final average score based on PM, PA,PW and WO, where sentences withspelling mistakes, errors in referencetext, incomprehensible reference textshave been excluded

3.575

Final average lowest score based on PM,PA, PW and WO, where sentences withspelling mistakes, errors in reference text,incomprehensible reference texts havebeen excluded

2.470

Acceptance percentage for searchabilityþ readability

45.04%

Fig. 4. Feedback section for MT on the Espacenet website (Source: The EPO, Espacenet,Patent Translate).


be initiated on the possible root causes, and corrective actionswill be planned.

- Should the human evaluation reach the threshold but the BLEUscore be lower than the established level, quality can beconsidered Acceptable or Unacceptable based on the distanceof the BLEU score from the expected score (LexisNexis defineda range of 5 BLEU points under the threshold as a reasonablevariance range).

Table 7 shows an example of results obtained by humanlyevaluating a test set of 1000 sentences for JPeEN, for which a BLEUscore of around 38 was calculated. The human evaluation wasperformed by two evaluators (one internal and one external),working independently and not sharing results throughout thewhole evaluation time. The results were consistent between theevaluators and showed a good correlation with the BLEU score forerror categories related to terminology correctness and complete-ness (related to searchability), with an acceptance rate of 94.36%.Both evaluators agreed on the fact that the engine excelled intransferring terminology from the source to the target: the trans-lation of noun phrases was very high quality, usually pertinent andtherefore very useful for retrieval. However the separate evaluationof the word order category was very useful to point out remainingissues in readability and in deciding how to prioritize the correctiveactions. A linear relation was observed between the word orderscore and the length of the sentences in the test set: the longer thesentence, the lower the score.

The evaluation resulted in action items and a corrective plan forthe improvement of the word order issue. Even after theimprovements, word order still represents the major challenge fortranslation from Japanese to English.

4. Conclusions

In no case, either in or outside the patent world, can MT beconsidered a substitute for human translation evenwhen its qualityalmost approximates the quality delivered by a human translator(which is a realistic goal at present). Being based on statistics,a definite set of grammar rules, the coding of dictionaries ora combination of these, MT cannot be considered to be fully andinfallibly able, at the current state of the art, of performing the textinterpretation and disambiguation that are needed for renderingthe meaning of a text and the logical connections among itsconstituents. This limitation is even accentuated by the complexfeatures typical of patent language explained in Section 2.1.

Where humans have difficulties in deciphering patents,a machine has even more difficulties. Whatever automatedmeasurement is adopted it will fail to evaluate the intelligibilityaspect, because no automated metric is able, at the moment, tomeasure whether the logical connections among the constituentsin the source language are retained in the target language. Theresults of automated evaluation, moreover, might be biased bydifferent factors, which need to be taken into consideration whendeveloping a test set.

LexisNexis and Asia Onlineworked together to develop a humanevaluation assessment framework to address the shortcomings ofautomated evaluation. The error categories meant to assesssearchability (terminology completeness and correctness) showa correlation with the results delivered by the BLEU score, whena threshold BLEU score for acceptance is defined upfront based onthe defined criteria and observations. However, intelligibility andreadability can only be assessed and scored by humans.

Despite the very good and useful tools provided by Asia Online,which facilitate a lot the work of human evaluators, human qualityassessments still remains a costly and time-consuming operation,


and performing them at this level of detail has not always proven tobe feasible at LexisNexis.

Moreover, some specific enhancements (such as terminologycoverage when resolving unknown words) cannot be measured byspot checking a test set of 1000 sentences, and only make sensewhen calculated on a large scale: LexisNexis therefore adoptedadditional automated metrics for evaluating translationcompleteness (i.e. % non-translated paragraphs, % of absolute andunique unknown words), as well as measures of system perfor-mance, system stability and technical output validity (i.e. XMLvalidation).

We believe the European Patent Office’s recent approach ofasking users to rate their opinion on translations on the Espacenetwebsite is another demonstration of the importance of humanfeedback in the MT quality evaluation process and a much easierand feasible way of using evaluators’ input for improving the MToutput (see Fig. 4). It hints, moreover, that MT quality should bedefined as the degree of usability of its output for a community-determined purpose.

Acknowledgments

1. Kindly reviewed by:Richard Garner, Product Director e IP Research Solutions, Lex-

isNexis, and Eric D.F.D. van Stegeren, Senior Director GlobalProgram Management & MD, LexisNexis

2. LexisNexis is a registered trademark of Reed Elsevier Prop-erties Inc., used under license. PatentOptimizer is a trademark ofLexisNexis, a division of Reed Elsevier Inc. Other products orservices may be trademarks or registered trademarks of theirrespective companies.

References

[1] Krikke J. Machine translation e inching toward human quality. IEEE IntelligentSystems 2006;21(2):4e6.

[2] O’Brien S. Controlling controlled English: an analysis of several controlledlanguage rule sets. In: Proceedings of EAMT/CLAW, Dublin; 2003.

[3] Booth JM, Gelb J. Optimizing OCR accuracy on older documents: a study ofscan mode, file enhancement, and software products. USGPO. Available from:http://www.gpo.gov/pdfs/fdsys-info/documents/WhitePaper-OptimizingOCRAccuracy.pdf; 2006 [accessed 12.09.12].

[4] Magdy W, Jones Gareth JF. An efficient method for using machine translationtechnologies in cross-language patent search. In: 20th ACM conference oninformation knowledge management (CIKM 2011), Glasgow; 2011.

[5] Kadri Y, Nie JY. A comparative study for query translation using linearcombination and confidence measure. In: IJCNLP 2008: third internationaljoint conference on natural language processing, Hyderabad; 2008.

[6] Oliveira F, Wong F, Leong KS, Tong CK, Dong MC. Query translation for cross-language information retrieval by parsing constraint synchronous grammar.In: Proceedings of 2007 international conference on machine learning andcybernetics, Hong Kong; 2007.

[7] Oard D. A comparative study of query and document translation for cross-language information retrieval. In: Proceedings of the 3rd conference of theassociation for machine translation of the Americas, Langhorne; 1998.

[8] Adams S. The text, the full text and nothing but the text: part 2 e the mainspecification, searching challenges and survey of availability. World PatentInformation 2010;32(2):120e8.

[9] Zhang Y, Sun L, Du L, Sun Y. Query translation in Chinese-English cross-language information retrieval. In: Proceedings of the 2000 joint SIGDATconference on empirical methods in NLP and very large Corpora, Hong Kong;2000.

[10] Gao J, Nie JY, Xun E, Zhang J, Zhou M, Huang C. Improving query translationfor cross-language information retrieval using statistical models. In:Proceedings of the 24th annual international ACM SIGIR conference onresearch and development in information retrieval, New Orleans; 2001.

[11] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic eval-uation of machine translation. In: Proceedings of the 40th annual meeting ofthe association for computational linguistics (ACL), Philadelphia; 2002.

[12] Doddington G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Available from: http://www.itl.nist.gov/iad/mig//tests/mt/doc/ngram-study.pdf. [accessed 10.01.12].

[13] Banerjee S, Lavie A. METEOR: an automated metric for MT evaluation withimproved correlation with human judgments. In: Proceedings of the ACL 2005

workshop on intrinsic and extrinsic evaluation measures for MT and/orsummarization, Ann Arbor; 2005.

[14] Turian JP, Shen L, Dan Melamed I. Evaluation of machine translation and itsevaluation. Available from: http://nlp.cs.nyu.edu/publication/papers/turian-summit03eval.pdf. [accessed 10.01.12].

[15] Carterette B, Voorhees EM. Overview of information retrieval evaluation. In:Lupu M, Mayer K, Tait J, Trippe AJ, editors. Current challenges in patentinformation retrieval. Springer; 2011. p. 69e85.

[16] SnoverM, Dorr B, Schwartz R, Micciulla L, Makhoul J. A study of translation editrate with targeted human annotation. In: Proceedings of the 7th conference ofthe association for machine translation in the Americas, Cambridge; 2006.

[17] Verberne S, D’hondt e Oostdijk N, Koster CHA. Quantifying the challenges inparsing patent claims. In: Proceedings of the 1st international workshop onadvances in patent information retrieval (AsPIRe), Milton Keynes; 2010.

[18] Kincaid JP, Fishburne RP, Rogers RL, Chissom BS. Derivation of new readabilityformulas (automated readability index, fog count and Flesch reading easeformula) for navy enlisted personnel. In: Research Branch Report 8-75, NavalTechnical Training, Millington, TN, U.S. Naval Air Station, Memphis, TN. 1975

[19] Coughlin D. Correlating automated and human assessments of machinetranslation quality. In: MT Summit IX, New Orleans; 2003.

[20] Fujii A, Utiyama M, Yamamoto M, Utsuro T. Evaluating effects of machinetranslation accuracy on cross-lingual patent retrieval. In: Proceedings of the32nd annual international ACM SIGIR conference on research and develop-ment in information retrieval, Boston; 2009.

[21] Fujii A, Utiyama M, Yamamoto M, Utsuro T. Toward the evaluation of machinetranslation using patent information. In: Proceedings of the 8th conference ofthe association for machine translation in the Americas, Waikiki; 2008.

[22] Wang D. Chinese to English automatic patent machine translation at SIPO.World Patent Information 2009;31(2):137e9.

[23] Cavalier T. Perspectives on machine translation of patent information. WorldPatent Information 2001;23(4):367e71.

[24] Jin Y. A hybrid-strategymethod combining semantic analysis with rule-basedMTfor patent machine translation. In: Proceedings of the 6th international confer-ence on natural language processing and knowledge engineering, Beijing; 2010.

Laura Rossi graduated cum laude in English and GermanLanguages and Literature for Management and Tourism in2002 at the Università Cattolica del Sacro Cuore of Milanwith a thesis on the application of machine translationtechnology in the software localization field. After that shejoined the localization world by working for two multi-national companies in the Netherlands (Océ-TechnologiesB.V. and Medtronic, Inc.), where she covered different rolesin a variety of areas: from rule-based machine translationcustomization, terminology extraction, terminologymanagement and controlled language to the localizationand release process workflow for technical documenta-tion, training materials, marketing materials and softwareapplications. In 2009 Laura embraced a new challenge at
LexisNexis Univentio, where she now works as BusinessSystems Analyst in the field of translation technologies.She is instrumental for Univentio’s direction of high-scale/-quality machine translation and she leads the intro-duction of new languages and technology. Her thoroughknowledge of the field is also utilized within the corpora-tion to introduce and implement new Machine Translationprojects.
DionWiggins is a highly experienced ICT industry visionary,entrepreneur, analyst and consultant, in thefields of softwaredevelopment, architecture and management, as well ashaving an in-depth understanding of Asian ICT markets.Previously Dion was Vice President and Research Directorfor Gartner based in Hong Kong, where he was the mostsenior analyst. Dion’s research reports on ICT in China helpedchange the way the world views this market. Dion is alsoa well-known pioneer of the Asian Internet Industry, beingthe founder of one of Asia’s first ever ISPs (Asia Online inHongKong). Inhis role atGartner and invariousother consul-ting positions prior to that, Dion advised hundreds of enter-prises on their ICT strategy. Dion was a founder of TheActiveX Factory, where he was recipient of the Chairman’s
Commendation Award presented by Microsoft’s Bill Gatesfor the best showcase of software developed in thePhilippines.TheUSGovernmenthas recognizedDionasbeingin the top 5% of his field worldwide and he is a former holderof a US O1 Extraordinary Ability Visa.
http://www.gpo.gov/pdfs/fdsys-info/documents/WhitePaper-OptimizingOCRAccuracy.pdf

http://www.gpo.gov/pdfs/fdsys-info/documents/WhitePaper-OptimizingOCRAccuracy.pdf

http://www.itl.nist.gov/iad/mig//tests/mt/doc/ngram-study.pdf

http://www.itl.nist.gov/iad/mig//tests/mt/doc/ngram-study.pdf

http://nlp.cs.nyu.edu/publication/papers/turian-summit03eval.pdf

http://nlp.cs.nyu.edu/publication/papers/turian-summit03eval.pdf

Date post:	08-Dec-2016
Category:	Documents
Upload:	dion
View:	229 times
Download:	9 times

Applicability and application of machine translation quality metrics in the patent field

Documents