Google Summer of Code 2011: UOC & Apertium

Post on 18-Nov-2014

1,266 views 0 download

description

Summary of the UOC participation in the Google Summer of Code 2012 together with Apertium.

transcript

Lluís VillarejoLearning Technologies

March 2012

Pre and post editing environment for Apertium

What is GSoC?• It's a global program that offers student developers stipends

to write code for various open source software projects.• Since 2005

• Inspire young developers to participate in OSS projects.• Give students more exposure to real-world soft dev

scenarios.• Get more open source code created and released.• Help open source prjs identify and bring in new developers.

c

Some participants

• Apache Soft. Found.• Debian• Facebook• Drupal• Creative Commons• DocBook project• GCC • Gnome• ...

• Sakai Foundation• Mozilla• Inclusive Design Inst.• The Linux Foundation• The GNU project• Wikimedia Foundation• WordPress• Inclusive Design Inst.• ...

c

How does it work?• Orgs present themselves as mentoring agents.• Orgs present a list of potential projects and mentors.• Accepted orgs should try to attract students' interest.• Students build project proposals.• Google finances slots for each org (5.000 + 500 USD).• The project community decides the student-slot assignation.• Between end of May and end of August.

c

GsoC'11 statistics

c

• $7.2M budget

• 1115 students accepted from 68 countries

• 2096 mentors and co-mentors from 55 countries

• 175 Open Source organizations

• 18.1% of students have participated in previous years

• 97 countries with student applicants

• 88% overall success rate

Accepted Students GSoC'11

c

Why participating with Apertium?• Strategically:

– Apertium is a strategic agent inside UOC.– Developing Apertium means further developing

internationalization aids for UOC.– Attract and onboard new developers for Apertium.– Collaboration with Google's Open Source initiatives.

• Functionally:– Opporutnity to further develop specific UOC needs with

external funding.– Capitalize specific user feedback on translation quality.

c

The Apertium case• 20 proposed tasks • 17 tasks got interest from students [1-9]

– Pre and post-editing environment gets 11 students interested.

• Apertium community ranks the 17 tasks– Pre and post-editing environment ranks 4th

• Google assigns 9 slots to Apertium (49.500 USD)– Our task goes through and Camille Mougey is selected

from the Grenoble Insitute of Technology.

c

Pre and post-editing, why?• An important part of the errors you get when translating a

document are due to deficiencies in the original.• The integration of existing resources can help to ease this

burden:– Digital knowledge sources (digital dictionaries... )– Automatic tools (spell-checker, grammar checker, translation

memory generation, search & replace...)• These processes should be integrated naturally in the

translation workflow → the need for an integrated web interface to Apertium.

• To improve the system we need to have access to the human post-editing process.

c

Pre and post-editing, features• Pre and Post-editing web interface integrated with Apertium translation toolbox.• Spell checking on source and target languages. Integration with Aspell• Grammar checking on source and target languages. Integration with

LanguageTool• Integration with several external dictionaries.• Search & replace functionalities on source and target languages. • Ability to deal with formatted text. • Logging system. All events are logged as they happen, ie at the very moment

the user inserts or deletes text. This allows for a further data mining process to be run on the logs to detect commonly modified structures or vocabulary.

• Translation memory generation. Integration of Maligna.• PDF translation through pdftohtml• Image translation. Through tesseract.

Final report 2010Final report 2011

c

Results & learned lessons• Fully functional environment, goals accomplished. • Automatic availability of feedback on post-editing human

behaviour.

• Jointly defined task (flexible framework provided).• Interest in developing great empathy with the student.• Motivated and pro-active student.• Student engagement.• Very frequent feedback.• Mentoring team with access to ABSOLUTELY ALL the

information regarding the project.

c

Further work• Proof of concept accomplished.• Base platform developed so further work can be easily

added.• Integration of other resources (more external dictionaries).• Extension of currently used resources (addition of

grammar rules, dictionaries improvement, format range extension).

• Logging information mining to get deeper knowledge on the human post-editing process.

• Use of this mining process to improve Apertium translation engine.

c

GsoC 2012

• Logging information mining to get deeper knowledge on the human post-editing process.

• Use of this mining process to improve Apertium translation engine.

• Post-edition over formatted text.

c

ThanksQuestions & answers

c