D5.1.2: Description and evaluation of software for...

D5.1.2: Description and evaluation ofsoftware for integrating interactive DIA, HTR,

and KWS

Enrique Vidal, Luis A. Leiva, Basilis Gatos, Iannis Pratikakis, Philip Kahle

Distribution: Public

tranScriptoriumICT Project 600707 Deliverable D5.1.2

February 28, 2014

Project funded by the European Communityunder the Seventh Framework Programme forResearch and Technological Development.

Project ref no. ICT-600707Project acronym tranScriptoriumProject full title tranScriptoriumInstrument STREPThematic Priority ICT-2011.8.2 ICT for access to cultural resourcesStart date / duration 01 January 2013 / 36 Months

Distribution PublicContractual date of delivery February 28, 2014Actual date of delivery February 28, 2014Date of last update February 28, 2014Deliverable number D5.1.2Deliverable title Description and evaluation of software for integrating inter-

active DIA, HTR, and KWSType ReportStatus & version FinalNumber of pages 28Contributing WP(s) 5WP / Task responsible Enrique VidalOther contributorsInternal reviewer Katrien Depuydt, Jesse de DoesAuthor(s) Enrique Vidal, Luis A. Leiva, Basilis Gatos, Iannis Pratikakis,

Philip Kahle

EC project officer Jose Marıa del AguilaKeywords

The partners in tranScriptorium are:

Universitat Politecnica de Valencia - UPVLC (Spain)University of Innsbruck - UIBK (Austria)National Center for Scientific Research “Demokritos” - NCSR (Greece)University College London - UCL (UK)Institute for Dutch Lexicology - INL (Netherlands)University London Computer Centre - ULCC (UK)

For copies of reports, updates on project activities and other tranScriptorium relatedinformation, contact:

The tranScriptorium Project Co-ordinatorJoan Andreu Sanchez, Universitat Politecnica de ValenciaCamı de Vera s/n. 46022 Valencia, [email protected] (34) 96 387 7358 - (34) 699 348 523

Copies of reports and other material can also be accessed via the project’s homepage:http://www.transcriptorium.eu/

c© 2014, The Individual AuthorsNo part of this document may be reproduced or transmitted in any form, or by any means,

electronic or mechanical, including photocopy, recording, or any information storage andretrieval system, without permission from the copyright owner.

Executive Summary

This document describes work which is being carried out for the development of interactiveapproaches for HTR and related technologies (DIA and KWS). It also contains details aboutdesign of the tranScriptorium workbenchs.

Contents

1 Introduction 4

2 Architectures for Platform Integration 52.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Plugin Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Content Provider Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Crowdsourcing Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.1 UPVLC Real-time Application Needs . . . . . . . . . . . . . . . . . . . . 82.4.2 System Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 92.4.3 Communication API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Interactive techniques for DIA 103.1 Interactive Techniques for DIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Interactive Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Interactive Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Initial Work and Plans for Next Year . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.1 Interactive Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.2 Interactive Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Interactive techniques for HTR 174.1 Review of state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Preliminary CATTI results on transcriptorium text images . . . . . . . . . . . . 194.3 Implementation of a CATTI-based transcription platform . . . . . . . . . . . . . 204.4 Plans for next period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Interactive techniques for KWS 205.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Initial Work and Plans for Next Year . . . . . . . . . . . . . . . . . . . . . . . . 22

A Full Communication, Low-level API to Connect HTR Servers and Web-basedClients 25A.1 catclient.js . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A.2 catclient.predictive.js . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3

1 Introduction

This report is the first version one of the two main deliverables of WP5, “Integration andInteraction”. It encompasses two main kinds of activities. One (T5.1) is the development of in-teractive processing methods associated to all the tecchniques developped in WP3 (fundamentalHTR research). The other activity (T5.2–T5.4) deals with the integration of the non-interactiveand interactive HTR tools developped in WP3 and WP5 into adequate platforms to performactual work of transcription of handwritten text image collections. Finally, there is an overallevaluation activity (T5.6) which encompasses to the all the testing and assessment work carriedin tasks T5.1-T5.5.

Basic facts about this workpackage are as follows:

Effort : UPVLC: 24pm, NCSR: 20pm, UIBK: 6pm, INL: 5pm, ULCC: 1pm

Tasks

T5.1 DIA, HTR, KWS tools for web platform (UPVLC, UIBK, NCSR, ULCC)

T5.2 Interactive techniques for DIA (NCSR, UPVLC)

T5.3 Interactive techniques for HTR (UPVLC, INL)

T5.4 Interactive techniques for KWS (NCSR, INL)

T5.5 Evaluation (UPVLC, NCSR)

T5.1 is active since month 1; while T5.2–T5.5 started in month 11.

Deliverables (due in month {14,34}):

There are two deliverables and two versions of each deliverable; the first version is due inmonth 14 and the second in month 34.

D5.{1,2}.1 Software for integrating interactive DIA, HTR, and KWS

D5.{1,2}.2 Description and evaluation of D5.{1,2}.1

Milestones:

MS3 1st. version of public DIA, HTR and KWS platforms (month 14)

MS6 2nd. version of public DIA, HTR and KWS platforms (month 36)

Evaluation in T5.5 refers to assessing:

• the correctness and efficience of the implemented systems(through standard software checking and debuging which is being done systematically in T5.1)

• the effectiveness of different interactive processing approaches(by means objective laboratory evaluation experiments)

• the usability of the implementd platforms and GUIs(by testing the systems with real users in real transcription tasks)

So far, this task has been active only for four monts and therefore only some laboratoryprelimiary results on interactive HTR are reported in this (first version) report.

4

2 Architectures for Platform Integration

In this section, the basic details of the client-server APIs are described. Since the goal consistin connecting different modules, in the following sections the basic communication APIs willbe described. It is worth pointing out that the following pages are mainly intended to describethe functions as such but not necessarily how the modules will be deployed in the final systemarchitecture.

Two scenarios are being taken into consideration, namely one aimed at targeting contentproviders and other aimed at targeting crowdsourcing providers. The Content Providerscenario can be summarized as follows:

• Focus on individual and institutional users that are willing to use the TRP system fortranscription and training of the HTR engine.

• Rich local client with many features that are important for expert users.

• User management with several roles, mainly designed for content providers to allow themfull control on the documents.

• Upload of documents is possible, HTR engine needs to be trained.

• Strong integration of TR platforms and components both on server as well as client levelin order to allow full performance for all use cases.

• Database, JAVA, SWD and standard calls of integrated components as core technology.

On the other hand, the Crowdsourcing Provider scenario can be summarized as follows:

• Focus on individual collection that is foreseen for transcription.

• Web-based client for transcription. Easy to use, specifically designed for external workers.

• No specific user management for the transcription process.

• Pre-filled collection with the HTR engine specifically tuned to this collection.

• Call of the TR server and other components via web-services for keeping the system easyand simple.

• MediaWiki, HTML5 and web-service technology.

The main differences between these two scenarios (and the corresponding platforms) aresummarized in Table 1.

2.1 Design Considerations

A very important design concern at the beginning of the development of a new software is toagree on the system’s architecture, because the architecture influences wide parts of the softwareand is very hard to change afterwards. It also has an impact on API specification and is mostlikely more important than choosing a language for the implementation.

Before presenting the general system architecture, there are some basic clarifications ofthe architecture for integration. We will explain these below, together with the requirementsinherent in the tasks that will be carried out in the tranScriptorium project. We can summarizeall design considerations as follows:

5

Table 1: Comparison of features of the two tranScriptorium transcription platforms.

Content Provider Platform Crowd Sourcing PlatformUser group Professional transcribers VolunteersContent Document upload & export Precompiled collections

Feature-rich local client, Web-based client,User interface Transcr. & segment. editors, Transcription editor,

Installation required Low access barrierRole-based user management,

Access Document-specific, Public, open accessAccess control

• A plugin interface that ensures

– Minimal interface and interdependence between core components.

– Binary compatibility of components across versions.

• Enable real-time responsiveness for interactive HTR.

• Hide the intricacies of the core transcription process and the communication layer fromthe developer of the transcription GUI.

2.2 Plugin Architecture

A plugin architecture has been designed to facilitate the deployment of the different enginesresulting from WP3, WP4 and WP6 research. The plugin architecture should use dynamiclibraries to load the engines in run-time. The pimpl idiom [27] or the bridge pattern [26] shouldbe used to provide binary compatibility [13].

This way, the engines and the server can be compiled separately only depending on a coupleof header files (without linking dependencies). Furthermore, as these headers are expected toprovide a minimalistic interface which should stabilize in early steps of development, binarycompatibility with older engine versions will be probably valid for long periods of time. Inaddition to this, the pimpl idiom will not force the developers to change their data structuresbut just require them to write a handful of methods for data conversion. UPVLC has experiencein a similar architecture involving several systems (e.g. CATTI, a wordgraph server, or even asuite of a statistical machine translation tools) with satisfactory results [1, 17].

2.3 Content Provider Platform

This platform (see Figure 1) is a stand-alone system aimed at targeting content providers.Three partners will work on it: UIBK, NCSR, and UPVLC. This system will feature interactiveimage preprocessing and DIA, as well as basic HTR transcription capabilities. The contentprovider system will serve two distinct user groups: archives and (historical) researchers. Themain features of this system include the following:

Website of the Transcription Platform (TRP) The website of TRP will serve as the mainhub for any information on the system. Promotion and learning material (e.g. screenvideos) maybe be used to attract the attention of users and to encourage them to use theTRP system. Users will be able to register on the website and to download the specificTRP Client with which they can get involved and be able to connect to the central TRPserver.

6

TRP Server The TRP server integrates most of the components developed in the tranScrip-torium project. It will contain HTR, DIA and KWS platforms and be able to call thesecomponents according to the actions carried out by the user. In order to allow the train-ing of the HTR engine also a connection with an HTR server will be implemented. It isimportant to understand that all documents will be finally managed by the TRP Serverand that all images will reside on a centralized system.

TRP Client The TRP Client is a full-stack JAVA-based software with which a researchercan perform all actions necessary for transcribing a handwritten document, either withthe support of an HTR engine in interactive mode, or purely manual, e.g. for trainingpurposes. The TRP client allows to edit and correct block and line segmentation whichare prerequisites for good HTR results. Individual users of the TRP client will be able toupload documents (see below) as well as to download transcribed documents, in TEI aswell as PDF format. A user management system will support content providers (individualas well as institutional) to organize their work and to distribute it e.g. among their workinggroup.

Upload and download of documents We have foreseen individual as well as institutionalupload of documents. The institutional upload will be done via a specific module, calledDEALOG. It provides mainly some specific web-services, rules and a user-managementsystem to upload large amounts of digitized documents on a regular basis. If an archivedecides to digitize e.g. a collection of some thousands of documents it would use thischannel. As part of the digitization workflow we would prefer not only to get the images,but also the segmented images, which means that text blocks and base lines (includingline regions) are processed in beforehand and uploaded together with the page images inPAGE format. As already indicated the individual upload of documents will be possiblevia the TRP Client. Download of documents will be organized in the same way: Individualusers will carry out this via the TRP Client, whereas institutional users will receive theirfiles via DEALOG.

2.4 Crowdsourcing Platform

The second platform is a fully web-based system aimed at targeting crowdsourcing platforms.Two partners will work on it: ULCC and UPVLC. This system will provide HTR web services,such as interactive HTR. As per particular request, this system requires preprocessed imagesand DIA, for which a common notification protocol will be defined in order to better integrateother modules.

As shown in Figure 2, the web-based crowdsourcing platform consists of 3 modularizedcomponents. First, a web server will hold a simplified image collection in a custom XMLformat, since the usual PAGE format is somehow over-engineered for the goals of this platform.This image collection will be queryable, so that given an entry point URL a JSON object willbe returned with the images that are available for transcription.

Second, the web-based client (typically a web application that runs on a browser) will com-prise two parts: the graphical user interface (GUI) and the communication API. The graphicaluser interface will be built using HTML5 and modern web technologies. The jQuery frameworkwill be used to connect the GUI with the HTR servers through a communication API. Suchcommunication API will be delivered as a jQuery plugin that abstracts connection routines (e.g.core functions and transport layer), so that its incorporation is immediate.

Third, the HTR server will comprise two parts: a server itself and a transcription engine. Theserver will manage GUI connection requests, so that the transcription engine can be completelydecoupled from the system architecture. The engine is actually a decoder, so that it is unawareof session data or transcription status. This way, the engine could be easily integrated in otherpieces of software.

7

TrpGUI TrpServer

TRP Core

METS

PAGE

TEI

PDF

Images

PAGE XML

SWT GUI

HTR Server

(Training &

Production)

XML/JSON

Transcripts

fimagestore

Database

TrpClient

Images

PAGE XML

Content

Provider/

Researcher

REST service

HTR/DIA

APIs

KWS API

TRP Core

DB Manager

Figure 1: Proposal of system architecture for content provider platforms.

PAGE collection

<XML>

<XML> Simplified XML collection

<XML>

<XML>

Image collection

Web Server Web Browser

jQuery plugin

catClient

socket.io

Communication API

HTR server

Graphical Interface

HTR engine

tS server

ws://

http://

Figure 2: Proposal of system architecture for crowdsourcing platforms.

2.4.1 UPVLC Real-time Application Needs

Interactive HTR systems impose some restrictions over the communication protocol, speciallysince requests must be served as real-time as possible and interactions are performed on char-

8

acter/keypress basis. For that matter, we will make extensive use of the WebSockets protocol.1

An important number of open source libraries can be found in the Web for the most popularprogramming languages.

2.4.2 System Architecture Overview

After extensive discussion between the project partners, a specific system architecture wasagreed. This system architecture tries to satisfy the set of special needs that were explainedabove. According to Figure 1 and Figure 2, there are four major components:

• Client: Provides the transcription/editor interface.

• Transcription server: provides access to transcription services, manages socket connec-tions, keeps track of user sessions, and abstracts the access to the HTR engine.

• Communication layer: Connects the GUI with the transcription server.

• HTR engine: handles transcription capabilities, both classical HTR and interactiveHTR.

In the proposed architecture, the client, the transcription server, the communication layer,and the HTR engine constitute separated entities. First, because the transcription server shouldbe decoupled to allow different transcription clients, e.g. a batch command-line client for ex-perimentation. Second, a physical separation will provide a more robust environment (thetranscription server and the HTR engine are both complex pieces of software) and will facilitatedistributed computing. Furthermore, the HTR engine has been separated from the transcrip-tion server since a client might not implement all HTR features, e.g. a command-line clientcould not be interested in interactive HTR capabilities. Finally, certain functions, specificallyfor the interactive HTR functionality, impose certain performance and real-time constraints.For that reason, WebSockets will be used to communicate clients and HTR engines. It is worthnoting that current WebSocket proxies for popular web servers are not quite mature yet. Then,if a proper technology matures during the development of the project, we will consider to putthe transcription server behind a web server.

2.4.3 Communication API

The following is a succinct description of the available API calls for web clients that make useof the jQuery editableItp plugin. This plugin was developed by UPLVC in the context of theEuropean project “Cognitive Analysis and Statistical Methods for Advanced Computer AidedTranslation”, and has proved to be very flexible to deploy real-time web-based applications. Inthese examples, $target is a jQuery object. A low-level JavaScript communication API thatdoes not depend on jQuery is provided at the end of this document in Appendix A.

$target.editableItp({

sourceSelector: "DOM element ID",

itpServerUrl: "socket.io resource URL"

});

Figure 3: This function is the entry point, or initialization code.

Optionally, the web client can decide to incorporate interactive text prediction capabilities.These capabilities are ruled by the following functions.

1https://tools.ietf.org/html/rfc6455

9

https://tools.ietf.org/html/rfc6455

$target.editableItp(’decode’);

Figure 4: This function returns an automatic transcription of the sourceSelector elementspecified in Figure 3.

$target.editableItp(’validate’);

Figure 5: This function signals to the HTR server that the translation is completely supervisedby the transcriber.

$target.editableItp(’startSession’);

Figure 6: This function indicates that the user will start an interactive HTR session.

$target.editableItp(’setPrefix’, caretPos);

Figure 7: This function submits an error-free prefix, according to the current caret position inthe editable text area. As a result, the system return the most suitable continuation of the

$target.editableItp(’endSession’);

Figure 8: This function indicates that the user will finish the interactive HTR session.

3 Interactive techniques for DIA

Interactive approaches are being studied in WP5 for all the techniques developped in WP3(fundamental HTR research). More specifically, the following issues are being explored in tasksT5.2-T5.5:

• T5.2. DIA user feedback used to:

– Tune binarization parameters

– Provide an initial image markup for image enhancement

– Spot missed text boundaries to improve segmentation performance

• T5.3. HTR iteractive-predictive processing:

– Use parts of a transcript which have been validated or amended by the user toautomatically correct other possible errors in the line image being transcribed

• T5.4 KWS with Relevance Feedback (RF):

– Iterative process borrowed from classical Image Retrieval with RF

– Similarity measures specifically designed for handqritten text images

These techniques are aimed at being integrated into adequate standalone software tools andpackages, some of which will be themselves integrated into the Content Provider platform (DIA,HTR and KWS) and the Croudsourcing platform (HTR only).

10

3.1 Interactive Techniques for DIA

Interactive DIA techniques exploit user assistance to provide visual information that is difficultto derive automatically. In many cases, the user is asked to perform meaningful interactionwith the document image in order to tune binarization parameters, to provide an initial imagemarkup that is used for document image enhancement or to spot missed word boundariesin order to facilitate word segmentation. In tranScriptorium, the goal is to investigate theinvolvement of the user in several DIA stages for pre-processing and segmentation of historicalhandwritten documents. The user feedback will be used to automatically improve the resultsof image enhancement, foreground/background separation, layout analysis, text line and wordsegmentation. To this end, we will study and adapt existing DIA techniques to take into accountthe user interaction.

3.2 Related Work

3.2.1 Interactive Pre-processing

Interactive approaches for the pre-processing of document images can be classified in two maincategories. The first category concerns the approaches that enhance the original grayscaleor color image [14, 11], while the second category concerns the approaches that enhance thebinarization result [20, 25, 12].

In the first category, the approach of [14] aims in enhancing the regions of faint text. Theuser selects the region of interest and the corresponding faint text emerges through the differenceof the original image to an estimated background leveraged by a multi-resolution Gaussian filter.As the user repeats the aforementioned procedure, the faint text is gradually restored. On theother hand, the approach of [11] is focused on removing the bleed-through. For this approach,both the front and the back images are required. Initially, the front and back images are alignedand then the user supplies markup on the front image. In more details, the user draws eitherstrokes or points to specify (a) the text, (b) the background and (c) the bleed-through (Figure 9).A confidence map is created (using features from the markup areas and a KNN classifier) andthe regions of low confidence are overlaid onto the original image. In those regions, the usersupplies markup as necessary. This procedure is repeated until it reaches a satisfactory result.

Concerning the interactive binarization approaches, in [20], the user is assisted by a graphicalenvironment to select a parametric binarization method and specify the corresponding param-eters (Figure 10), e.g. the window size. After each binarization process, feedback is collectedamong predefined options and a new set of parameters is suggested. The user can alter thesuggested parameters if necessary.

Another interactive binarization approach is the one proposed in [25]. This approach isbased on interactive evolutionary computing. A set of operations is defined including bothparametric (e.g. global threshold [T=1-254], window-based background enhancement and blur-ring) and non-parametric operations (e.g. Otsu thresholding [15], fixed 3x3 sharpening, fixed3x3 smoothing, etc.). Thus, a large number of combinations is available. For each iteration,three alternatives are given to the user and one or none can be chosen. The feedback of the useris processed by the evolutionary computing and choices that are scarcely made are excludedfrom the alternatives. The aforementioned procedure is followed until a satisfactory result isreached. The average required time for a single image was five minutes approximately.

Furthermore, in the interactive binarization approach of [12], the user roughly marks theundetected text region onto the original document image. At a next step, the problematicregion is better estimated using a region growing technique (Figure 11). If the region doesnot occupy the whole problematic area, more markup must be given by the user. Finally, thisregion is binarized using a different thresholding formula based on the standard deviation ofthe corresponding background (the background is detected by applying Otsu [15] locally to thisregion).

11

Figure 9: The markup procedure of the interactive method [11] that removes the bleed-through.

3.2.2 Interactive Segmentation

Only a few works in the literature exist that deal with the task of interactive document seg-mentation. Most of them focus on the layout analysis step whereas one method focuses oninteractive word segmentation. Layout analysis refers to the detection of main page elements aswell as the discrimination between text and non-text zones. Word segmentation concerns theprocess of detecting the word boundaries starting from a text line image.

Interactive methods for layout analysis include the work of Breuel [2], in which GUI controls(e.g. mouse clicks, text box editing) are used to modify layout parameters and instantly observethe response. In this way, the method permits interactive layout analysis within a few secondsper page. Another interesting approach is described in [16]. In this method, an automaticresult is produced. In a second step (interactive phase), the user defines a scenario whichcorresponds to a set of well defined rules and according to the rules the image representationevolves progressively leading to the best possible characterization of its contents (??).

Hadjar et al. [8] use the idea of incremental learning in an interactive environment. Theinteractive scenario is presented in Figure 13.

Interactive word segmentation methods include the work of Fischer et al. [5]. In this work,for each text line an automatic word segmentation result is produced. At a next step, the userinteracts with the proposed result in order to correct it by changing the word limits defined bythe automatic method (Figure 14).

3.3 Initial Work and Plans for Next Year

3.3.1 Interactive Pre-processing

The interactive approaches for the enhancement of the original image concern either cases offaint characters or cases of bleed-through. The work of [11] seems promising but it requires

12

Figure 10: Graphical environment of the interactive binarization approach of [20].

(a) (b)

Figure 11: 11a The rough marking provided by the user. 11b the problematic area as detectedby the region growing technique [12].

both the front and the back images. Moreover, as it can be seen in Figure 9, bleed-throughis removed while at the same time most faint characters are also be removed. Initial work oninteractive enhancement includes a variety of different image enhancement techniques (Wienerfilter, Gaussian blurring, unsharp filter and background smoothing via background estimation)that are provided to the user. After visual inspection, the most appropriate technique is selected.Each of the aforementioned techniques is accompanied by a set of predefined parameters that canbe adjusted by the user. In Fig. 2.2.1.1, representative enhancement filters are demonstrated.

13

Figure 12: Illustration of the notion of scenario used in the user-driven approach [16].

Figure 13: Interactive scenario for the class “title” in a layout analysis environment [8].

Figure 14: GUI for word segmentation correction [5].

(a) (b) (c) (d)

Figure 15: 15a Original image, 15b background smoothing via background estimation, 15c 3x3gaussian filter and 15d 9x9 unshap filter.

Concerning the interactive binarization approach, a parametric version is proposed duringthe first year of the project. A default case handles effectively the majority of the cases (Fig-ure 16). Additionally, for the cases of severe faint characters and severe bleed-through, the usercan select the corresponding parameters (Figure 17, Figure 18). As future work, we plan tofurther develop the proposed interactive binarization approach by providing more options.

In addition, the binarization result can be enhanced using several techniques, such as contour

14

(a) (b)

Figure 16: A binarization example using the default version of the proposed interactive bina-rization approach.

(a) (b) (c)

Figure 17: 17a Original Image, 17b the output using the default version of the proposed in-teractive binarization approach. Obviously, faint text is undetected and a different selection isrequired by the user, 17c the output using the version for faint character detection.

(a) (b) (c)

Figure 18: 18a Original Image, 18b the output using the default version of the proposed in-teractive binarization approach. Obviously bleed-through remained and a different selection isrequired by the user, 18c the output using the version for bleed-through removal.

enhancement, morphological closing or despeckle filtering. The contour enhancement techniqueeffectively smooths the binary contour from the ”teeth effect”. The morphological closingreduces inner irregularities of the binarized characters. The despeckle filter is parametric and thecorresponding parameter denotes the size of connected components to be removed (Figure 19).Predefined options as well as user adjustments will be available for the morphological closingand the despeckle operations.

15

(a) (b) (c)

Figure 19: 19a Initial binary Image, 19b-19c 7x7 and 15x15 despeckle filtering, respectively.

3.3.2 Interactive Segmentation

In the second year of the project, wWe plan to investigate the effectiveness of state-of-the-artworks concerning interactive layout analysis and also examine user interaction for the proposedlayout analysis methods (Figure 20). Also, we will work on the scenario of classification of aregion by the user in an incremental learning interactive environment (e.g. define an area asa side text/note and use this information to better adapt the corresponding layout analysisprocedure).

Figure 20: Examine interactivity in developed layout anaysis method.

For the word segmentation task, we plan to incorporate user interaction on the developedmethod [10]. The automatic procedure produces a word segmentation hypothesis based on acalculated distance threshold (either at line or page level). The user will verify the correctnessof the hypothesis. In the case of error, the user will be able tp increase/decrease the distancethreshold in order to produce a new word segmentation hypothesis (Figure 21).

16

Figure 21: Interactive word segmentation scenario. The system produces an automatic seg-mentation result. At a final step, the user interacts with the system (by changing the distancethreshold) in order to produce the correct hypothesis.

4 Interactive techniques for HTR

Work carried out in T5.3 Interactive techniques for HTR is described in this section. The goalof this task is to study interactive techniques and different interactive protocols for HTR.

4.1 Review of state of the art

Interactive HTR techniques have been proposed recently for transcribing handwritten docu-ments. In this approach the user and the system work jointly in tight mutual collaboration toobtain perfect transcripts of the text images. The interactive handwritten text transcriptionsystem used here was recently introduced by the UPVLC team and presented in [23, 18]. Itis refered to as “Computer Assisted Transcription of Text Images” (CATTI). In the CATTIframework, the human transcriber is directly involved in the transcription process since he/sheis responsible of validating and/or correcting the HTR output.

Before the interactive transcription starts, usual Document Image Analysis techniques areapplied to each page image. After this step, the lines of the page image are detected. In thetraining stage, the HMM and the N-grams models of the HTR system used are trained with thetext line images of the training set and their corresponding correct transcripts. Finally, in therecognition stage, new text line images are transcribed and a word graph is obtained for eachline (see Deliverable D.3.1 [24] for details). During the CATTI process the system make use ofthese word graphs in order to complete the prefixes accepted by the human transcriber.

The interactive transcription process starts when the HTR system proposes a full transcript(w) of a given text line image (x). In each interaction step the user validates a prefix (p′) ofthe transcript which is error free and keys in new information (κ; i.e., a word or a part thereofto correct the erroneous text that follows the validated prefix, or a simpla keystroke to rejectthe erroneous word). At this point, the system, taking into account the feedback of the user(consolidated as a validated prefix, p), suggests a suitable continuation (s). This process isrepeated until a complete and correct transcript (T ) of the input signal is reached. A key pointof this interactive process is that, at each user-system interaction, the system can take advantageof the prefix validated so far to attempt to improve its prediction. Figure 22 illustrates a CATTIinteraction session to transcribe an image containing handwritten text in Spanish.

Interactive-predictive technology behind CATTI

CATTI is based on general principles of Interactive Pattern Recognition[22]. The underlyingtechnology is illustrated in Figure 23 and summarized as follows:

Given input image, x and correct transcript prefix p, find a most likly sufix s:

s = arg maxs

P (s | x, p) = arg maxs

P (x | p, s) · P (s | p)

17

IMPACT ASM 2010 HTR and CATTI

Computer–Assisted Transalation of Text Images (CATTI): example

x

s ≡ w antiguas cuidadelas que en el Castillo sus llamadas p′ antiguas cuidadelas que en el Castillo sus llamadas

STEP-1 κ antiguos cuidadelas que en el Castillo sus llamadas p antiguos ciudadanos que en el Castillo sus llamadas s antiguos ciudadanos que en el Castillo sus llamadas p′ antiguos ciudadanos que en el Castillo sus llamadas

STEP-2 κ antiguos ciudadanos que en Castilla sus llamadas p antiguos ciudadanos que en Castilla se llamaban s antiguos ciudadanos que en Castilla se llamaban p′ antiguos ciudadanos que en Castilla se llamaban

FINAL κ antiguos ciudadanos que en Castilla se llamaban #p ≡ T antiguos ciudadanos que en Castilla se llamaban

Post-editing Word Error Rate WER: 6/7 (86%)CATTI Word Stroke Ratio (WSR): 2/7 (29%), assuming whole-word correctionsEstimated Effort Reduction (EFR): 1− 29/86 (66%).

E. Vidal – ITI-UPV-DSIC Page 21

Figure 22: Interactive-predictive CATTI processing. Direct user corrections in red; correctionsautomatically made thanks to user feedback in blue. Without CATTI, a user should have toamend 5 system word errors out of 7 correct words (WER=5/7=71%), while using CATTI onlytwo word amendments are needed (WSR=2/7=29%).

Solving this problem requires; a) Adequate models for the two probability distributionsinvolved and b) A search method to actually solve the underlying optimization problem in avery efficient way (as required by real time constraints of interactive operation). As discussedin [18], the followng modeling and search choices have proven adequate for CATTI:

• P (x | p, s): concatenated character morphological HMMs (same as in HTR)

• P (s | p): prefix-conditioned N-Gram Language Model (similar as in HTR)

• Search: using Word-Graphs obtained by as a byproduct of HTR

s

x s

p

Interactive-Predictive

CATTI

Off-line

Trainingmodels

HTR

input text image

(correct transcript prefix)

user feedbackx

transcript suffix

predicted

x , s

. . .

x , s 2 2

1 1

Figure 23: The CATTI interactive-predictive framework.

Assessment measures

Different evaluation measures are adopted to assess the CATTI system. On the one hand, thequality of non-interactive transcription can be properly assessed with the well known Word

18

Error Rate (WER) (see Deliverable D.3.1 [24] for a detailed descrition). The WER is a goodestimate of the user effort required to directly post-editing the output of the HTR system.

On the other hand, the effort needed by a human transcriber to produce correct transcrip-tions using the CATTI system is estimated by the Word Stroke Ratio (WSR), which is definedas teh number of (word level) user interactions that are necessary to achieve the referencetranscriptions of the text images considered, divided by the total number of reference words.

This definition makes WER and WSR comparable. The relative effort (in %) that a tran-scriber using plain HTR or CATTI can save with respect to fully manual transcription is just100-WER or 100-WSR, respectively. These figures are refered to as Estimated Effort Reduction(EFR). In addition, the relative difference between WER and WSR gives us a good estimte ofthe reduction in human effort that can be achieved by using CATTI with respect to using aconventional HTR system followed by human post-editing. This estimated effort reduction willbe denoted as “EFR*”.

4.2 Preliminary CATTI results on transcriptorium text images

In this subsection preliminary experiments on the Bentham corpus with the CATTI system arepresented.

As explained in [24], to carry out fast HTR experiments to test different techniques withthe Bentham database, 53 pages from the first batch of 433 pages were chosen. Here we havecarried out CATTI experiments using these 53 pages.

In these experiments, we have used the same values of the parameteres employed to ob-tain the baseline, non-interactive results presented in Deliverable D3.1 [24]. Table 2 shows theestimated interactive human effort (WSR) required in comparison with the corresponding esti-mated post-editing effort (WER). It also shows the estimated effort reductions For comparisonpurposes, previous results obtained on other data sets [18] under similar experimetal conditionsare also shown.

Table 2: Performance of plain, non-interactive HTR (WER) and CATTI (WSR), along with thecorresponding Estimated user Effort Reduction with respect to manual transcription (EFR) andwith respect to post-editing the raw output of plain HTR (EFR*). These preliminary resultsobtained with a small dataset of 53 pages from the Bentham collection. Previous results onother data sets [18] are also shown for comparison purposes.

Dataset WER WSR EFR EFR*

Bentham (preliminary) 31.5 27.7 72.3 12.1

IAMDB (modern English) 25.3 18.6 81.4 26.5CS (XIX Cent. Spanish) 33.5 28.4 71.6 15.2

According to these results, to produce 100 words of a correct transcription in the Benthamtask, a CATTI user should have to type only less than 30 words; the remaining 70 are automat-ically predicted by CATTI. That is to say, the CATTI user would save about 70% of the (typingand, in part thinking) effort needed t produce all the text manually. On the other hand, wheninteractive transcription is compared with post-editing, from every 100 (non-interactive) worderrors, the CATTI user should have to interactively correct only less than 88. The remaiing 12errors would be automatically corrected by CATTI, thanks to the feedback information derivedfrom other interactive corrections.

19

4.3 Implementation of a CATTI-based transcription platform

A first working version of CATTI engine, along with the correspopnding API’s, web server andproof-of-concept client (GUI) have been developped and implemented:

• Interactive-predictive HTR server supporting CATTI API calls:startSession(obj), setPrefix(obj), rejectSuffix(obj) ...

Preliminary API proposal available at:transcriptorium.eu/demots/js/lib/catclient.predictive.js

This API complements the basic HTR server calls discussed in T5.2

• Proof of concept GUI supports CATTI functionality: transcriptorium.eu/demots/poc/

• Advanced user interface being developed by ULCC in WP6

4.4 Plans for next period

The work on thhis task has just started but, thanks to the use of exixting backgrownd, progresshave been fast. Plans for the following period include:

• Fully develop interactive HTR (CATTI) technology,

• Implement stable versions of CATTI engine, along with the correspopnding API’s andweb server,

• Thoroughly test and debug the remote connection of a CATTI server with ULCC userinterfaces.

5 Interactive techniques for KWS

5.1 Introduction

Today many digitized ancient manuscripts are not exploited due to lack of proper browsingand indexing tools. These documents are characterized by complicate layout variations, variousforms of writing while many of them are degraded. The strategies that are involved for modernprinted documents such as OCR systems, they are not working as they are sensitive to noise,character variation and text layout. Nonetheless, today there is not a system to analyze andsearch words residing in ancient handwriting manuscripts with a satisfactory way. A validstrategy to deal with these kind of unindexed documents is a word matching procedure thatis relying on a low-level pattern matching called word spotting. It can be defined as the taskof identifying locations on a document image which they have high probability to contain aninstance of a queried word without explicitly recognizing it. It is accomplished by extractingfeature vectors from the word shape and texture. Unfortunately, some times there is a semanticgap between these vectors and the semantic meaning of the words images. The RelevanceFeedback is an online procedure which incorporates user knowledge in order to provide moreprecise retrieval results. Figure 24 presents an architecture of a typical Relevance Feedback(RF) process and it is compromised by the following steps:

1. the systems presents to the user the initial retrieval results

2. the user label some items as relevant (correct) or irrelevant (wrong)

3. the system by incorporating the user information, it provides new retrieval results.

4. the above steps are repeating until the user is satisfied

20

http://transcriptorium.eu/demots/js/lib/catclient.predictive.js

transcriptorium.eu/demots/js/lib/catclient.predictive.js

http://transcriptorium.eu/demots/poc/

transcriptorium.eu/demots/poc/

Initial Query Word Image

Labeled Words(relevant/irrelevant)

Learning System

Retrieval Results

Relevance Feedback Loop

Final Retrieval Results

User FeedbackDatabase

Figure 24: Typical Relevance Feedback architecture

5.2 Previous Work

There are two main learning paradigms for relevance feedback: the planned [4, 21] and thegreedy paradigms [28, 3]. In the former, the system first presents the most informative images inorder the underlying learning algorithm better analyzes the CBIR results relevance distribution.Finally, after a number of interactions the more relevance images are presented to the user. Inthe greedy paradigm the system presents straight away the most relevance images in order theuser to grade them.

Another categorizing scheme for relevance feedback schemes are those that aims to modifythe initial query and those that mainly intents to alter the similarity measure handling theranking of the results.

The query vector modification (QVM) approach repeatedly reformulates the query vectorthrough users feedback so as to move the query toward relevant images and away from nonrel-evant ones.

Usually, the similarity measure handling is implemented through a machine learning algo-ritm.

Howe [9] compares different strategies using AdaBoost. Guo et al. [7] performed a com-parison between AdaBoost and SVM and found that SVM gives superior retrieval results.Unfortunately, the problem of using machine learning algorithms lies to the small training set.

Zagoris et. al. [28] developed a relevance feedback mechanism for handwritten word spot-ting system based on Support Vector Machines (SVM). The algorithm uses the user supplyinformation as training data to the SVM and a normalized decision function value as relevancescore.

Chen et al. [3] described a one-class SVM method for updating the feedback space whichproduces substantially improved results. Silva et. al. [4] present a framework based on theoptimum-path forest classifier using normalized distances to special positive and negative ex-amples (called prototypes).

Finally, Rusinol and Llados [19] described some relevance feedback strategies including querymodification and query expansion. The best results in their setup was the Giacinto and Rolissimilarity measure modification algorithm [6].

21

5.3 Initial Work and Plans for Next Year

Figure 25 presents the proposed relevance feedback technique. It is based on the query modifi-cation paradigm as it expands the query vector. In details the steps of the proposed algorithmis:

1. the systems presents to the user the initial retrieval results

2. the user selects the correct results

3. for each positive word a retrieval operation will be performed

4. all the ranked lists will be merged and presented to the user

User Selects Similar Word

Images

Similarity Measure

Ranked List

Database

For each Positive Word

Similarity Measure

Ranked List

Similarity Measure

Ranked List

Merged Ranked List

Figure 25: Proposed Relevance Feedback architecture

In the next year different merged ranked list strategies will be explored based on the inves-tigation of the score distribution including:

• Similarity Value models as:

– CombMIN: choose the minimum of similarity values

– ComMAX:choose the maximum of similarity values

– CombMED: take median of similarity values

– CombSum: sum of similarity values

• Probabilistic Models

• Rank based models

For similarity-based modifications, we will try to introduce machine learning techniques andinclude more active participation of the local document specific points to the proposed UserFeedback Model. Finally, as the query by example word spotting framework is considered asrecommender system to a transcription operation, we will explore speedup relevance feedbacktechniques.

22

References

[1] V. Alabau, D. Ortiz-Martnez, V. Romero, and J. Ocampo. A multimodal predictive-interactive application for computer assisted transcription and translation. In ICMI-MLMI’09: Proceedings of the 2009 international conference on Multimodal interfaces, pages 227–228. ACM, 2009.

[2] Thomas M Breuel. High performance document layout analysis. In Proc. Symp. DocumentImage Understanding Technology, 2003.

[3] Yunqiang Chen, Xiang Sean Zhou, and Thomas S Huang. One-class svm for learning inimage retrieval. In Image Processing, 2001. Proceedings. 2001 International Conferenceon, volume 1, pages 34–37. IEEE, 2001.

[4] Andre Tavares Da Silva, Alexandre Xavier Falcao, and Leo Pini Magalhaes. Active learningparadigms for cbir systems based on optimum-path forest classification. Pattern Recogni-tion, 44(12):2971–2978, 2011.

[5] Andreas Fischer, Volkmar Frinken, Alicia Fornes, and Horst Bunke. Transcription align-ment of latin manuscripts using hidden markov models. In Proceedings of the 2011 Work-shop on Historical Document Imaging and Processing, HIP ’11, pages 29–36, New York,NY, USA, 2011. ACM.

[6] Giorgio Giacinto and Fabio Roli. Instance-based relevance feedback for image retrieval. InNIPS, 2004.

[7] Guodong Guo, HongJiang Zhang, Stan Z Li, et al. Boosting for content-based audioclassification and retrieval: An evaluation. In ICME, 2001.

[8] Karim Hadjar, Oliver Hitz, Lyse Robadey, and Rolf Ingold. Configuration recognitionmodel for complex reverse engineering methods: 2(crem). In Daniel Lopresti, JianyingHu, and Ramanujan Kashi, editors, Document Analysis Systems V, volume 2423 of LectureNotes in Computer Science, pages 469–479. Springer Berlin Heidelberg, 2002.

[9] Nicholas R Howe. A closer look at boosted image retrieval. In Image and Video Retrieval,pages 61–70. Springer, 2003.

[10] Georgios Louloudis, Basilios Gatos, Ioannis Pratikakis, and Constantin Halatsis. Text lineand word segmentation of handwritten documents. Pattern Recognition, 42(12):3169–3183,2009.

[11] Zheng Lu, Zheng Wu, and Michael S Brown. Directed assistance for ink-bleed reductionin old documents. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEEConference on, pages 88–95. IEEE, 2009.

[12] Zheng Lu, Zheng Wu, and Michael S Brown. Interactive degraded document binarization:An example (and case) for interactive computer vision. In Applications of Computer Vision(WACV), 2009 Workshop on, pages 1–8. IEEE, 2009.

[13] Murray. ABI Stability of C++ Libraries. http://www.murrayc.com/blog/permalink/

2007/03/12/abi-stability-of-c-libraries/.

[14] Oliver A Nina. Interactive enhancement of handwritten text through multi-resolutiongaussian. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conferenceon, pages 769–773. IEEE, 2012.

[15] N. Otsu. A threshold selection method from gray level histogram. IEEE Transactions onSystem, Man, Cybernetics, 19(1):6266, 1978.

23

http://www.murrayc.com/blog/permalink/2007/03/12/abi-stability-of-c-libraries/

http://www.murrayc.com/blog/permalink/2007/03/12/abi-stability-of-c-libraries/

[16] Jean-Yves Ramel, Nicolas Sidre, and Frdric Rayar. Interactive layout analysis, contentextraction, and transcription of historical printed books using pattern redundancy analysis.Literary and Linguistic Computing, 28(2):301–314, 2013.

[17] V. Romero, L. A. Leiva, V. Alabau, A. H. Toselli, and E. Vidal. A web-based demo tointeractive multimodal transcription of historic text images. In Proceedings of the 13th eu-ropean conference on digital libraries (ECDL), pages 459–460. Springer Berlin/Heidelberg,2009.

[18] V. Romero, A.H. Toselli, and E. Vidal. Multimodal Interactive Handwritten Text Tran-scription. Series in Machine Perception and Artificial Intelligence (MPAI). World ScientificPublishing, 1st edition edition, 2012.

[19] Marcal Rusinol and Josep Llados. The role of the users in handwritten word spottingapplications: Query fusion and relevance feedback. In Frontiers in Handwriting Recognition(ICFHR), 2012 International Conference on, pages 55–60. IEEE, 2012.

[20] Vavilis Sokratis and Ergina Kavallieratou. A tool for tuning binarization techniques. InICDAR, volume 11, pages 1–5, 2011.

[21] Simon Tong and Edward Chang. Support vector machine active learning for image retrieval.In Proceedings of the ninth ACM international conference on Multimedia, pages 107–118.ACM, 2001.

[22] A. H. Toselli, E. Vidal, and F. Casacuberta, editors. Multimodal-Interactive Pattern Recog-nition and Applications. Springer, 2011.

[23] A.H. Toselli, V. Romero, M. Pastor, and E. Vidal. Multimodal interactive transcription oftext images. Pattern Recognition, 43(5):1824–1825, 2010.

[24] tranScritporium. D.3.1.2. Description and evaluation of tools for DIA, HTR and KWS.2013.

[25] Tijn van der Zant, Lambert Schomaker, and Axel Brink. Interactive evolutionary comput-ing for the binarization of degenerated handwritten images. In Electronic Imaging 2008,pages 681507–681507. International Society for Optics and Photonics, 2008.

[26] Wikipedia. Bridge pattern. http://en.wikipedia.org/wiki/Bridge_pattern.

[27] Wikipedia. Opaque pointer. http://en.wikipedia.org/wiki/Opaque_pointer.

[28] Konstantinos Zagoris, Kavallieratou Ergina, and Nikos Papamarkos. Image retrieval sys-tems based on compact shape descriptor and relevance feedback information. Journal ofVisual Communication and Image Representation, 22(5):378 – 390, 2011.

24

http://en.wikipedia.org/wiki/Bridge_pattern

http://en.wikipedia.org/wiki/Opaque_pointer

A Full Communication, Low-level API to Connect HTR Serversand Web-based Clients

A.1 catclient.js

/**

* Server connection method.

* @param url {String} Server URL to connect to

*/

function connect(url);

/**

* Event handler.

* @param event {Mixed} String or Array of strings name of trigger

* @param callback {Function} Callback

*/

function on(event, callback);

/**

* Event triggering.

*/

function trigger(...arguments);

/**

* Check connection status.

*/

function isConnected();

/**

* Tries to reconnect if connection drops.

*/

function checkConnection();

/**

* Retrieves decoding results for the current segment.

* @param {Object}

* @setup obj

* ms {Number}

* @trigger pingResult

* @return {Object}

* errors {Array} List of error messages

* data {Object}

* @setup data

* ms {Number} Original ms

* elapsedTime {Number} ms 0 by definition

*/

function ping(obj);

/**

* Configures server as specified by the client.

* @param {Object} Server-specific configuration

* @trigger configureResult

* @return {Object}

25


* data {Object} Configuration after setting the server

* @setup data

* config {Object} Server-specific configuration

* elapsedTime {Number} ms

*/

function configure(obj);

/**

* Validates source-target pair.

* @param {Object}

* @setup obj

* source {String}

* target {String}

* @trigger validateResult

* @return {Object}


* data {Object}

* @setup data


*/

function validate(obj);

/**

* Resets server.

* @trigger resetResult

* @return {Object}


* data {Object} Response data

* @setup data


*/

function reset();

/**

* Retrieves server configuration.

* @trigger getServerConfigResult

* @return {Object}


* data {Object} Response data

* @setup data

* config {Object} Server-specific configuration


*/

function getServerConfig();

/**

* Retrieves decoding results for the current segment.

* @param {Object}

* @setup obj

* source {String}

* numResults {Number} How many results should be retrieved

* @trigger decodeResult

* @return {Object}


* data {Object}

* @setup data

26

* source {String}

* sourceSegmentation {Array} Verified source segmentation


* nbest {Array} List of objects

* @setup nbest

* target {String} Result

* targetSegmentation {Array} Segmentation of result


* [author] {String} Technique or person that generated the result

* [alignments] {Array} Dimensions: source * target

* [confidences] {Array} List of floats for each token

* [quality] {Number} Quality measure of overall hypothesis

*/

function decode(obj);

/**

* Retrieves tokenization results for the current segment.

* @param {Object}

* @setup obj

* source {String}

* target {String}

* @trigger getTokensResult

* @return {Object}


* data {Object}

* @setup data

* source {String}





*/

function getTokens(obj);

A.2 catclient.predictive.js

/**

* Start predictive session.

* @param {Object}

* @setup obj

* source {String}

* @trigger startSessionResult

* @return {Object}


* data {Object}

* @setup data


*/

function startSession(obj);

/**

* Send decoding results for the current segment.

* @param {Object}

* @setup obj

* target {String} Segment text

* caretPos {Number} Index position of caret cursor

* [numResults] {Number} How many results should be retrieved (default: 1)

27

* @trigger setPrefixResult

* @return {Object}


* data {Object}

* @setup data

* source {String} Verified source



* nbest {Array} List of objects

* @setup nbest



* elapsedTime {Number} ms Time to process each resuls

* [author] {String} Technique or person that generated the result

* [alignments] {Array} Dimensions: source * target

* [confidences] {Array} List of floats for each token

* [quality] {Number} Quality measure of overall hypothesis

* [priorities] {Array} List of integers, indicates token positions

*/

function setPrefix(obj);

/**

* End predictive session for the current segment.

* @trigger endSessionResult

* @return {Object}


* data {Object}

* @setup data


*/

function endSession();

28

Date post:	30-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

D5.1.2: Description and evaluation of software for...

Documents