Date post: | 13-Apr-2017 |
Category: |
Science |
Upload: | icarus-international-centre-for-archival-research |
View: | 502 times |
Download: | 2 times |
Recognition and Enrichmentof Archival Documents
Facts and Figures• READ
• Recognition and Enrichment of Archival Documents• 13 Partners, coordinated by the University of Innsbruck• 10 Institutions as associated partners via a Memorandum of
Understanding• Duration: 1.1.2016 to 30.6.2019• Grant: 8,2 mill. EUR
• Objectives• Applied research in pattern recognition and human language
technology• Services for archives, humanities scholars, volunteers and
computer scientists• Network building among those user groups
READ ConsortiumREAD Partners
University of Innbruck(co-ordinator)
University of London
Technical University Valencia Technical University Lausanne
University College London University of Rostock
National Centre for Scientific Research - Demokritos
XEROX – European Research Centre
Technical University Vienna University of Leipzig
National Archive Finland Diozesan Archive Passau
READ MoU PartnersREAD MoU Partners
Australian National Library Gottfried Wilhelm Leibniz Bibliothek
National Library of Spain Centre virtuel de la connaissancesur l'Europe Digital Humanities Lab (Luxembourg)
The Linnean Society of London The Hessian State Archive Marburg (Germany)
The Munch Museum (Norway) The Civic Archives of BozenBolzano (Italy)
Music and Instrument Museum Leipzig
The University and Research Library Erfurt/Gotha (Germany)
Friedrich-August-UniversitätErlangen/Nürnberg
PLANET GmbH. (Germany)
What will remain once the project has finished its work in June 2019?
Publications• H2020 Grant Agreement• Article 29.2 Open access to scientific publications
• Each beneficiary must ensure open access (free of charge online access for any user) to all peer-reviewed scientific publications relating to its results.
• Open Access• Golden way
• E.g. FrontiersIn from EPFL (Technical University Lausanne)• Green way
• Key Performance Indicator• 15-25 scientific publications per year
Research Data• H2020 Grant Agreement• 29.3 Open access to research data
• Regarding the digital research data generated in the action (‘data’), the beneficiaries must:
• (a) deposit in a research data repository and take measures to make it possible for third parties to access, mine, exploit, reproduce and disseminate — free of charge for any user — the following:
• (i) the data, including associated metadata, needed to validate the results presented in scientific publications as soon as possible;
What are Research Data in READ?• Images and corresponding Reference Data
(= ground truth)• Images = Raw material• Reference data or ground truth = the expected, perfect
output• Data = what is actually produced by an algorithm/tool
• Example• Image of a page• Correct text of a page = reference data• Data = the text produced by a HTR engine • Difference between expected result and actual result =
the result of a scientific experiment, e.g. measured as Word Error Rate
Research Data are used for…• Evaluation
• Difference between expected and actual result• Problem description / requirements specification
• What do we actually expect from an algorithm or tool?• Simple with HTR, but becomes much more complicated with
Layout Analysis• E.g. do we need the whole text of a page, or maybe just
person names within one column of a table? Such questions need to be defined and need to be reflected in the design of the reference data
• Machine learning (training data)• Machine learning tools need training data• Reference data are the basis for this training process
Research Data in READ• Key Performance Indicator
• 3 Mill. Images with at least 50.000 pages of reference data at the end of the project
• Why such a large amount?• Our objective is that the READ dataset is “somehow”
representative for many document types in archives, for writing and layout styles of several centuries and languages
• We are therefore very much interested in any kind of digitised document collection
• Progress in computer science is strongly connected to the availability of large data sets
Research Data for Competitions• Key Performance Indicator
• READ will organise several research competitions at various conferences
• Competitions• Nowadays a popular way to measure the progress of research
in a specific field. E.g. line detection, or text recognition, or writer retrieval…
• Evaluation of competition results• Depends on the availability of reference data
• Attractiveness of competitions• Dependent on the challenge itself, but also on the size of
dataset and the quality of reference data• 160.000 EUR are foreseen as sub-contracts for the production
of reference data
Research Data in READ• Images will be connected with reference data such as:
• Correct text (e.g. on page or line level)• Correct writer attribution (e.g. letters with names of writers)• Correct person names on page level• Correct layout elements, e.g. text lines, text blocks, tables, or
forms• Detailed descriptions of tables or forms• Everything which is interesting for archives, scholars, the
public!
• Data will be made available e.g. via ZENODO or other Research Data Platforms
• Archives are encouraged to provide their collections!
Open Source Software• Release as OS
• Not an obligation of the Grant Agreement, but from the specific e-Infrastructure call of the EU
• Foreseen for (nearly) all software tools in the project• During 2016 we will take the first steps and move parts
of the software to GITHUB or a similar platform
• Advantage• Many tools are research tools and therefore “not easy”
to implement• The implementation in Transkribus will allow users to try
out the tools in beforehand
Interim summary• Open Access to publications
• E.g. via Open Access publishers
• Open Research Data (images and reference data)• E.g. via Repositories, such as ZENODO (run by CERN Data
service)
• Open Source for the software tools• E.g. via open software repositories, such as GITHUB
An (expert) user will have “everything together” to dive deeper into the results of the project
Open Platform
Build a platform which provides recognition, transcription and enrichment of historical documents as a general infrastructure for archives, libraries, humanities scholars, volunteers, the public – and computer scientists.
Why a Platform? (1)• Software as a Service (SAAS)
• Implementation of the full range of tools from READ requires a lot of work and knowhow
• The entrance hurdle for archives and humanities scholars is much lower since the services can be accessed and used via the Internet
• E.g. users are free to upload their documents, to run tests and to further decide which services they want to use
• Machine Learning• Most tools require large amounts of training data • The more data are available in the platform the higher the
chance to improve accuracy• E.g.: if a user in Greifswald transcribes a German text from
1700 these data may also be used to train the HTR engine for a user in Bavaria. Or in the US.
Why a platform? (2)• Cooperation
• Successful digitisation projects need collaboration between content holders, scholars, computer scientists and volunteers
• Platform serves as a mediation tool between these stakeholder groups
• E.g. they can define requirements, produce reference data, implement new services, edit and correct results in a shared manner
• Standardisation• Full benefits of technology can only be enjoyed if a large
variety of standards is obeyed• De-facto standardisation by using the same platform and
tools• E.g. the real benefit of digital editions will be enjoyed once
they are centrally accessible
Service Platform• READ Service Platform = Transkribus
• We are obliged to run the service platform from the very first day of the project
• We are also obliged to provide a business plan in month 12• And to implement this business plan after month 12
• Final objective• To run and maintain the service platform also after the end of
the project• A business model needs to be developed
• General approach• Service levels • To provide free services for everyone – only if some limits are
exceeded than service fees will be applied
Overview of tools and services• Handwritten Text Recognition
• HTR based on HMM and on NN
• Keyword Spotting• Query by Example• Query by String
• Image Preprocessing• Binarisation, Enhancement
• Layout Analysis• Basic analysis of words, lines, region types (text, graphical,…)
• Table and Forms Recognition• Generic and template based recognition
Overview of tools and services• Document Understanding
• Columns, marginalia, date, etc.• Automatic Writer Identification and Retrieval
• Training and retrieval of specific writers/writing styles• Language Toolkit
• Adaptation of language resources to support HTR• Text2Image matching
• Matching existing text with images• E-Learning module
• Online training tool for students and volunteers to practise deciphering of handwritten documents
• ScanApp• The mobile phone as document scanner with direct connection to the
Transkribus service platform• And many more…
READ Platformhttp://transkribus.eu/
READ Websitehttp://read.transkribus.eu/ (coming soon)
User’s guidehttp://transkribus.eu/wiki/
Thank you a lot for your attention!