transLectures /machine translation 4
educationLanguages & the MediaBerlin, November 23rd 2012Davor [email protected] for All Foundation Ltd
AGENDA
VideoLectures.NET Content, Statistics, Licenses, Partners Education, MOOCs, OpenCourseWare Consortium,
Opencast Matterhorn
The idea behind Who, What, Why, When?
transLectures Pillars, Current status, Results and Demo
VIDEOLECTURES.NET
WHAT IS IT? VideoLectures.NET is the largest OER free and open access digital
library of academic talks. The lectures are given by distinguished scholars and scientists at conferences, summer schools, workshops.
WHAT IS THE CONTENT?
Content built up via European research projects based in Computer Science fields. Other content from OCW partners.
WHAT ARE THE STATS? 732 events, 10512 authors, 13726 lectures, 15965 videos Visits: 9,626,639 Page views: 26,011,939 Signed in users: 23560 Licenses: CC-NC-ND
VIDEOLECTURES.NET STATS
THINK OF MOOCs
I enrolled in the MOOC “Intro to Databases” winter 2011 at Coursera 108,000 accounts 475,000 assignment submissions 3,150,000 video views (heavy use of video)
Wouldn't it be awesome if all such content and future options would be multilingual? Language personalisation for millions of
students Video, audio, papers, coursework - all
multilingual
transLectures THE IDEA BEHIND WHAT WAS THE REASONING?
Huge set of HigherEd users (undergrads, MA, MSc, PhD) Huge collection of videos Videos are made of audio and video Audio and video are data Data can be harvested, changed and remixed
WHAT IF? We capture the audio Transform it into text
WHAT THEN? We can have subtitles, transcriptions, translations, personalisation,
contextualisation, descriptions, time alignment, fragmentation, recommendations, for 15965 academic talks
STATE OF LANGUAGE TECHNOLOGY - MT
Same for:
Speech Processing, Text Analysis, Speech and Text Resources
Most of Europe's Languages are apparently unlikely to survive in the digital age. (META-NET white paper)
transLectures PRE-TEXT
LEARNERS PREFER VIDEO? YouTube (78 hours per minute upload) MOOCs (3 mio accounts)
INITIATIVES AROUND VIDEO? Open content: OCW (20.000 courses) Massive lecture capture system: Opencast
Matterhorn project (700 Universities) Massive portals specialized in video lectures:
VLN, Polimedia (25.000 academic videos)
transLectures CV
SPECS? Cost: 4,5 mio EUR Project ref no. ICT-287755 Project acronym: transLectures Project full title: Transcription and Translation of Video Lectures Instrument: ICT-2011.4.2 Language Technologies Thematic Priority: STREP Start date / duration: 01 November 2011 / 36 Months
WHO? Universidad Politecnica De Valencia, Xerox, Knowledge 4 All
Foundation Ltd., RWTH, European Media Laboratory Gmbh, Deluxe Digital Studios Ltd
OpenCast Matterhorn, VideoLectures.Net, Polmedia
transLectures IN A NUTSHELL WHAT IS THE AIM?
To develop innovative, cost-effective solutions to produce accurate transcriptions and translations in VideoLectures, To deploy those tools across other Matterhorn-related repositories. For translation, we consider the language pairs: en⇆es, en⇆sl, enfr and ende.
WHAT IS THE IMPACT? A big step in making educational repositories truly accessible both to speakers of different languages and to people with disabilities.
ADDITIONAL VALUE? Imagine having 16000 lectures in most of the world`s languages.
transLectures WHAT, WHY
KEYWORDS? language technologies, machine translation, automatic speech
recognition, massive adaptation, intelligent interaction, education, video lectures, multilingualism, accessibility
WHY TRANSCRIPTION & TRANSLATION? There are accessibility issues that can be solved by
transcription Non-native speakers understand better by reading than by
hearing At least 1,300 different languages with more than 100,000
native speakers No language with more than 20% of the world population
transLectures STATUS
TRANSCRIPTION (EML) the complete transcription of English lectures took 45000
hours (2 months running parallel) TRANSLATION (XRCE, UPV, RWTH)
different segmentation strategies for transcription and translation being considered
INTELLIGENT INTERACTION WITH USERS experimental protocol to evaluate intelligent interactive
approaches for users INTEGRATION
first steps on integration software into VL, Polimedia, Matterhorn
EVALUATION human evaluations for the second round of evaluation
CONCLUSION and FUTURE Technology is good enough for transcription & translation
We are going to develop open tools for transcription and translation
Deploy the tools in the Opencast Matterhorn system Think of a business plan and ideas on a spin-off Provide optimisations for existing languages
Ideally extend the language set to Chinese, Hindi and other
Is intelligent interaction a realistic concept? More focus on English into Slovenian translations to
improve them. Work on building a community of students for evaluation
Thank you.
WEBSITES:
http://www.translectures.eu/ http://videolectures.net/http://polimedia.upv.es/catalogo/http://www.k4all.org/
Languages & the MediaBerlin, November 23rd 2012Davor [email protected] for All Foundation Ltd
ADD. FEATURES
Accuracy estimation for each transcription and translation.
Adjustable computational behaviour. Output constrained to user
preferences and corrections. Fast learning from user corrections.
KNOWLEDGE FOR ALL (K4ALL) WHAT IS IT?
K4ALL is a Foundation based in London (2010) with the goal of providing the legacy of the PASCAL2 Network of Excellence (machine learning), part of this legacy is also the VideoLectures.NET website and strong connections in Opencast Foundation (creating the Matterhorn software) and Open Courseware Consortium.
WHAT DOES IT DO? I4All: Provision and distribution of infrastructure that supports the
K4A mission S4All: Online Science video journals and conference special issues E4All: Organization and access to educational material R4All: Research that facilitates the mission of K4All A4All: Ensuring accessibility for as wide an audience as possible