Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | alisa-frazier |
View: | 30 times |
Download: | 1 times |
IRF Symposium 2007 Vienna, AustriaNovember 8-9, 2007, Mariott Hotel
Presentation: Machine Translation Chinese-EnglishSome experiments
Dr. Barrou DIALLO, Head of Research, EPO
2
EPO Research The case of Machine Translation
Our Vision & Mission
MT versus Patents
The Chinese language caseOur Experiments
Our Accomplishments
Perspectives
3
Our Vision & Mission (1/3)
R&D center as a source of Efficiency:
• Efficient Reading
• Accurate Searching
• Fast Granting
Our Vision: Turning Technology into IP Business
4
The EPO Research Department
Merged in March 2007 in a new Information Management structure; became "horizontal"
Located in The Hague, Netherlands Large portfolio of academic contacts (Labs, Universities) Entry point for testing and evaluating industrial solutions since 1990 Partnerships with International institutions (WIPO, EC) Strong background in mathematics, algorithms, and data structures Network of active users and testers inside the EPO
Our Vision & Mission (2/3)
5
Our mission & Mission (3/3)
Coordinating research initiatives across departments Technology watch and green-field research Performing quantitative analysis Identifying and communicating business opportunities Providing users with sensible options - courses of action Ensuring smooth transition from research to development Communicate practices and experiences Report and advise over technical solutions to decision-makers
Help addressing Challenges
6
EPO Research The case of Machine Translation
Our Vision & Mission
MT versus Patents
The Chinese language caseOur Experiments
Our Accomplishments
Perspectives
7
MT versus PatentsA Strategic Domain foreseen 5 years ago
Needs less investment than expected Can re-use existing data and knowledge Mature enough to improve efficiency Satisfies patent professionals Offers a key technology for future language
challenges
Lessons learned from the European Machine Translation Programme
8
EPO Research The case of Machine Translation
Our Vision & Mission
MT versus Patents
The Chinese language caseOur Experiments
Our Accomplishments
Perspectives
9
Chinese language case (1/)
Issue 1: Sentence + Word Segmentation Issue 2: Text ReorderingIssue 2: Text Reordering Issue 3: Alignment + System training Issue 4: Translation with proper terms Issue 5: Regeneration
10
Example: The Re-ordering Issue
[Brown & al. 93] set the foundations of the SMT approach (use of Bayes' theorem)
[Knight 99] approach (Model 3) to word re-ordering does bring in some improvement in the target sentence, but it is rather oriented towards French or English structures.
[Chiang 05] proposes to re-order sentences in Chinese by using hierarchical phrase pairs, which are phrases that contain subphrases. Produce better results than the traditional phrase-based
approach.
Many Years of research on the subject:
11
The Re-ordering Issue
Re-ordering: the phrase-base approach
"Australia is diplomatic relations with North Korea is one of the few countries"
12
Step 1
Step 2
Re-ordering :
Hierarchical-phrase approach (1/2)
13
"Australia is one of the few countries that have diplomatic relations with North Korea".
Step 3
Re-ordering :
Hierarchical-phrase approach (2/2)
14
Solution?A semi-automatic approach
Computer-Assisted Translation (CAT) Using high-quality manually-aligned texts based on international
organizations bi-text repositories and translation memories. Using a bilingual ontology to align words or phrases which are
not present in the training corpuses. There are available ontologies of patent vocabulary in English; a manual Chinese translation of the central concepts could be
gradually added by IPC category Use syntactic rules to improve lexical choices and collocation
processing. I.e Univ. of Geneva (Chomsky syntactic parser for English) process to guarantee a well-formed final English sentence
15
EPO Research The case of Machine Translation
Our Vision & Mission
MT versus Patents
The Chinese language caseOur Experiments
Our Accomplishments
Perspectives
16
Comparison of MT systemAn empirical approach (1/3)
Rule based system (Systran) Statistical system (Language Weaver) Hybrid system (CCID prototype)
1 Evaluation grid
3 systems on the test bench
Scores of 1-4 Usability & Readability criteria
17
Comparison of MT systems (2/3)
Poor (1) Medium (2) Good (3) Excellent (4)
Rule-based MTHybrid MT ? ???Statistical MT
18
Comparison of MT systemAn empirical approach (3/3)
No MT system performs properly, CAT (Computer Aided Translation) seems necessary
The hybrid system seems more promising Post-editors needed for checking outputs?
No statistical significance is to be reported - further investigations needed!
19
Readability Tests on Human Translations: Flesch et al.
Designed to indicate how difficult a reading passage is to understand.
There are two tests: Flesch Reading Ease Flesch–Kincaid Grade Level.
This test has become a standard. Bundled with popular word processing programs
20
Flesch Reading Ease score : 206.835 – (1.015 x ASL) – (84.6 x ASW)
Rates text on a 100-point scale; the higher the score, the easier it is to understand the document (60 to 70 for standard docs).
Where:ASL = average sentence length (# words / # of sentences)ASW = average number of syllables per word (# syllables / # of words)
Flesch-Kincaid Grade Level score: (.39 x ASL) + (11.8 x ASW) – 15.59
Rates text on a U.S. school grade level. A score of 8.0 means that an eighth grader can understand the document (7.0 to 8.0 for standard docs)
Readability Tests on Human Translations: Flesch et al.
21
Human Translation assessmentExample (1/2)
CN1926077 The Making and Using Methods of Plant/Soil Activated Liquid
Abstract
In the mineral composition ion water of concentrated sulfuric acid, which add the vegetal leavening confected by enzyme and microbe used to produce enzyme and the muscovado made by sugarcane together, under the aerobic condition, the selective preference is, do the commensalisms cultivation at about 25 Centigrade. After decomposing the sugar, before rot and ferment, the selective preference is, spreading on the leaf surface or pouring in the soil during the alcohol fermenting stage.
Flesch-Kincaid Reading Ease score: 13/100Flesch-Kincaid Grade level: 17.Score: 7/10
Comments: The Abstract and parts of the claims are convoluted/badly structured in parts and some spelling mistakes.
What's Important?Figures or
Comments?
22
Human Translation assessmentExample (2/2)
CN2354381 Claims 1. A time switch of gas appliances, composing of mechanical gear timer and fuel
gas valve, wherein it also comprises round upper cover board subassembly and lower cover board subassembly, a valve switch knob (4) fixed on the upper end of the valve switch spigot shaft (7) is installed on the front of the upper cover board, the valve switch spigot shaft (7) penetrates through the upper cover board (6) and the lower cover board (29), a timer hollow shaft (8) is installed out of the valve switch spigot shaft (7), the timer hollow shaft (8) penetrates through uthe pper cover board (6), a round time knob (5) is installed between the upper end valve switch knob of the timer hollow shaft and the upper cover board (6), a time indicating dial (3) interlocking with the timer hollow shaft (8) is installed between the round time knob (5) and the upper cover board (6); a mechanical gear timer is installed on the reverse side of the upper cover board (6), an unlocking cam(9) is installed out of the timer hollow shaft (8) in the central part;
Flesch-Kincaid Grade level: 49.Flesch-Kincaid Reading Ease score: -45.Score: 9/10Comments: Long convoluted sentences. Diagrammatical explanations. Minor grammatical and typo errors.
23
Human vs machine: unfair competition?
One kind to combs the type generator using a phase lock agility frequency modulation output signal to form the output any to designate channel's installment and the method. The track input signal's phase error, this input signal is modulated the carrier output frequency, with should modulate the output frequency, the use subtracts this input signal the method to lock combs the type generator output, and eliminates this phase error
一种利用相位锁定一捷变频率调制输出信号到梳式发生器形成输出的任何选定信道的装置和方法。跟踪输入信号的相位误差,该输入信号被调制成载波输出频率,和该调制过的输出频率,
利用减去该输入信号的方法锁定到梳式发生器输出,并消除该相位误差。
An apparatus and method is disclosed which phase locks a frequency-agile modulated output signal to any selected channel of a comb generated output. The phase error of an input signal is tracked, the input signal is modulated up to a carrier output frequency, and the modulated output frequency is locked to the comb generator output by subtracting the input signal and negating the phase error.
Systran
Human translation
Original text
Is such an MT useful?Is such an MT useful?
24
EPO Research The case of Machine Translation
Our Vision & Mission
MT versus Patents
The Chinese language caseOur Experiments
Our Accomplishments
Perspectives
25
Chinese patents showing Priority documents 105000 CN documents with US priorities 15000 CN documents with EP priorities 15000 CN documents with GB priorities 15000 CN documents with EP priorities 400 CN documents with WO priorities
A sufficient source for starting-up an alignment?
# of aligned sentences
Our Accomplishments
(June 2006)
26
Manual Data cleaningDirty texts generate XML failures
CN86103346
Spherical particles of vinyl resins having high bulk density can be prepared by the suspension polymerization process by using as a dispersant an alkyl hydroxy cellulose having a viscosity of from about 1000 to about 100,000 cps. A suitable dispersant is a hydroxypropyl methyl cellulose polymer having the formula: <IMAGE> +TR <IMAGE> where n is from about 300 to about 1500.
Use of XMLSpy Professional to check text
27
Methodology of World Alignment
[OCH93]
28
First Example of alignment
29
Second example of alignment
30
TMX Formatting of aligned texts
<?xml version="1.0" ?> <!DOCTYPE tmx SYSTEM "tmx14.dtd"> <tmx version="1.4"> <header creationtoolversion="1.0.0" datatype="plaintext"
segtype="sentence" adminlang="EN-US" srclang="EN" o-tmf="txt" creationtool="MetaReadAlign" >
</header> <body> <tu> <tuv xml:lang="EN"><seg> In a preferred embodiment, a low-band
isolator network, coupled to the antenna element, provides signal isolation between high-band and low-band signal paths during high-band operation.</seg></tuv>
<tuv xml:lang="ZH"><seg> NOT DISPLAYABLE </seg></tuv> </tu>
Provides compatibility to Industry standards
Evaluation record CN85108669
Welcome EvaluatorX
Save Status Reset
• 100% match
•>70% match
•<50% match
•partial translation
•bad translation
•total mismatch
Radio buttons, multiple entries possible (e.g. partial translation, 100% match), default value "100% match"Entries saved on server
Save status for next time
Transmit EvaluationReset the complete evaluation process (everything gets resetted and lost)
Record Evaluated,Proceed with next
Saves the selected buttons for this record and jump to next record
Evaluated/not evaluated
Record Status
Allows browsing
QUALITY CONTROL PANEL BEFORE ALIGNMENT
32
EPO Research The case of Machine Translation
Our Vision & Mission
MT versus Patents
The Chinese language caseOur Experiments
Our Accomplishments
Perspectives
33
Acknowledgments
EPO Staff experts in Research & Development
Jan Mannekens
Betty Yang
CrossLanguage
Metaread
University of Geneva
Questions?
34
References
Brown & al. 93 Brown, Della Pietra, Mercer: The Mathematics of Statistical Machine Translation: Parameter Estimation, ACL vol.19 no.2, 1993
Kevin Knight: A Statistical MT Tutorial Workbook, April 1999
David Chiang: A Hierarchical Phrase-Based Model for Statistical Machine Translation, Proceedings of the 43rd Annual Meeting of the ACL, 2005