DML 2010 Proceedings - Masaryk Universitysojka/dml10.pdfDML 2010 Towards a Digital Mathematics...

DML 2010Towards a DigitalMathematics Library

}w�� !"#$%&'()+,-./012345<yA|http://www.fi.muni.cz/~sojka/dml-2010.html

http://www.fi.muni.cz/~sojka/dml-2010.html

Petr Sojka (editor)

DML 2010

Towards a Digital Mathematics Library

Paris, FranceJuly 7–8th, 2010

Proceedings

Masaryk University, Brno, 2010

Proceedings Editor

Petr SojkaFaculty of Informatics, Masaryk UniversityDepartment of Computer Graphics and DesignBotanická 68aCZ-602 00 Brno, Czech RepublicEmail: [email protected]

CATALOGUING-IN-PUBLICATION – NATIONAL LIBRARY OF THE CZECHREPUBLICDML 2010 (Paris, France)

DML 2010 : Towards a Digital Mathematics Library : Paris, France, July7-8th, 2010 : proceedings / Petr Sojka (editor). – 1st ed. – Brno :Masaryk University, 2010. – VIII+135 p.

ISBN 978-80-210-5242-0

025:004.08 * 930.25:004.08 * 51:81’42’373.46 * 002.2:004 * 004.91 *004.352.242 * 004.93’1 * 004.832.2- digital libraries- digital archives- mathematical texts- digitization of documents- data processing- OCR technology- pattern recognition- fulltext search- proceedings of conferences

- digitální knihovny- digitální depozitáře- matematické texty- digitalizace dokumentů- zpracování dat- technologie OCR- rozpoznávání vzorů- fulltextové vyhledávání- sborníky konferencí

006 - Special computer methods [23]

004.9 - Speciální počítačové metody. Počítačová grafika [23]

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication orparts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, andpermission for use must always be obtained from Masaryk University. Violations are liable for prosecutionunder the Czech Copyright Law.

c© Masaryk University, 2010

ISBN 978-80-210-5242-0

http://www.fi.muni.cz/usr/sojka

mailto:[email protected]

Organization

DML 2010 was organized by Faculty of Informatics, Masaryk University, Brno,Czech Republic with the help of CNAM (Conservatoire National des Arts etMetiers), Paris, France. Web page of the workshop is http://www.fi.muni.cz/~sojka/dml-2010.html.

Program Committee

José Borbinha (Technical University of Lisbon, IST, PT)Thierry Bouche (University Grenoble I, Cellule Mathdoc, FR)Michael Doob (University of Manitoba, Winnipeg, CA)Thomas Fischer (Goettingen University, Digitization Center, DE)Yannis Haralambous (Télécom Bretagne, FR)Václav Hlavác (Czech Technical University, Faculty of Engineering, Prague, CZ)Michael Kohlhase (Jacobs University Bremen, DE)Janka Chlebíková (Comenius University, MFF, Bratislava, SK)Enrique Maciás-Virgós (University of Santiago de Compostela, ES)Jirí Rákosník (Academy of Sciences, Institute of Mathematics, Prague, CZ)Eugénio Rocha (University of Aveiro, Dept. of Mathematics, PT)David Ruddy (Cornell University, Library, US)Petr Sojka (Masaryk University, Faculty of Informatics, Brno, CZ) [chair]Volker Sorge (University of Birmingham, UK)Masakazu Suzuki (Kyushu University, Faculty of Mathematics, JP)

Organizing CommitteeMichal Ružicka (technical support and administrative contact), Renaud Rioboo,Laurence Rideau (local organization). and Petr Sojka (chair, Proceedings)

Sponsors and Support

The DML workshop and preparation of the Proceedings was supported by theMasaryk University, Brno, and by EU project EuDML, project number 250,503.





Table of Contents

I Towards a Digital Mathematics Library

Towards a Digital Mathematical Library: On the Road . . . . . . . . . . . . . . . . . . 3Petr Sojka (Masaryk University, Brno, Czech Republic)

Mathematical Formulae Recognition and Logical Structure Analysis ofMathematical Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Masakazu Suzuki (Kyushu University, Kyushu, Japan)

II Digitization Technologies and Platforms

EuDML—Towards the European Digital Mathematics Library . . . . . . . . . . 11Wojtek Sylwestrzak (University of Warsaw, Warsaw, Poland), José Borbinha(Instituto Superior Técnico, Lisbon, Portugal), Thierry Bouche (UniversitéJoseph-Fourier, Grenoble, France), Aleksander Nowinski (University ofWarsaw, Warsaw, Poland), Petr Sojka (Masaryk University, Brno, CzechRepublic)

Developing a Metadata Exchange Format for Mathematical Literature . . . 27David Ruddy (Cornell University Library, Ithaca, New York, USA)

Designing a Semantic Ground Truth for Mathematical Formulae . . . . . . . . 37Alan Sexton, Volker Sorge (University of Birmingham, Birmingham, UK),Masakazu Suzuki (Kyushu University, Kyushu, Japan)

III DML Building Experience

PDF Enhancements Tools for a Digital Library . . . . . . . . . . . . . . . . . . . . . . . . . 45Radim Hatlapatka, Petr Sojka (Masaryk University, Brno, Czech Republic)

Metadata Editing and Validation for a Digital Mathematics Library . . . . . 57Miha Filej (University of Ljubljana, Ljubljana, Slovenia), Michal Ružicka,Martin Šárfy, Petr Sojka (Masaryk University, Brno, Czech Republic)

Implementing Dynamic Visualization as an Alternative Interface toa Digital Mathematics Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Zuzana Neverilová (Masaryk University, Brno, Czech Republic)

Data Enhancements in a Digital Mathematical Library . . . . . . . . . . . . . . . . . 69Michal Ružicka, Petr Sojka (Masaryk University, Brno, Czech Republic)

VIII Table of Contents

IV Digitization Reports

bdim: the Italian Digital Mathematical Library . . . . . . . . . . . . . . . . . . . . . . . . . 79Vittorio Coti Zelati (Università degli Studi di Napoli “Federico II”, Napoli,Italy)

INSPIRE: Realizing the Dream of a Global Digital Library inHigh-Energy Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Annette Holtkamp, Salvatore Mele, Tibor Šimko, Tim Smith (CERN, Geneve,Switzerland)

V Tools and Techniques

Mathematical Communication and Representation in a VirtualLearning Environment: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95César Córcoles, Antonia Huertas (Open University of Catalonia, Barcelona,Spain)

Producing MathML with Tralics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105José Grimm (Institut National de Recherche en Informatique et enAutomatique, Sophia Antipolis, France)

Symbol Declarations in Mathematical Writing: A Corpus Study . . . . . . . . . 119Magdalena Wolska (Universität des Saarlandes, Saarbrücken, Germany),Mihai Grigore (Jacobs University, Bremen, Germany)

Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Name Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Part I

Towardsa Digital Mathematics Library

Towards a Digital Mathematical LibraryOn the Road

Petr Sojka

Masaryk University, Faculty of InformaticsBotanická 68a, 60200 Brno, Czech Republic

[email protected]

Abstract. The workshop’s objectives were to formulate the strategyand goals of a global mathematical digital library and to summarizethe current successes and failures of ongoing technologies and relatedprojects.There is already some experience with building smaller DMLs and/orbuilding big thematical scientific digital libraries. Why there are alreadybig fulltext digital library in some domains like PubMed in biomedicalone, but none in others? We try to pose such and other questions, and tryto find some answers in papers of this proceedings.

There is no royal road to mathematics. (Menaechmus, 380–320 BC)

1 The Dream

Mathematicians dream of a digital archive containing all peer-reviewedmathematical literature ever published, properly linked and validated/verified.It is estimated that the entire corpus of mathematical knowledge published overthe centuries does not exceed 100,000,000 pages, an amount easily manageableby current information technologies.

“There is no royal road to mathematics”, heard Alexander the Great thou-sands years ago. It seems that there is no royal road to DML, either. Tofulfill the dream, concerted action of Digital specialists, computer scientists,Mathematicians, topical experts, and Librarians, curators, information special-ists is needed.

Our workshop’s objective is to pave the road for DML, built bottom-upfrom smaller repositories. It is to formulate the strategy and goals of a globalmathematical digital library and to summarize the current successes and failuresof ongoing technologies and related projects, asking such questions as:

* What technologies, standards, algorithms and formats should be used andwhat metadata should be shared?

* What business models are suitable for publishers of mathematical literature,authors and funders of their projects and institutions?

* Is there a model of sustainable, interoperable, and extensible mathematicallibrary that mathematicians can use in their everyday work?

Petr Sojka (editor): DML 2010, Towards a Digital Mathematics Library, pp. 3–6.c© Masaryk University, 2010 ISBN 978-80-210-5242-0


http://www.fi.muni.cz/usr/sojka/

http://www.fi.muni.cz/usr/sojka/dml-2010.html

4 Petr Sojka

* What is the best practice for• retrodigitized mathematics (from images via OCR to MathML and/or

TEX);• retro-born-digital mathematics (from existing electronic copy in DVI,

PS or PDF to MathML and/or TEX);• born-digital mathematics (how to make needed metadata and file

formats available as a side effect of publishing workflow [CEDRAMmodel, Euclid])?

The intention was to have the workshop as a forum for presentationand discussion of the latest developments in the the field of digitizationof mathematics, based on the previous bilateral discussions and successfulworkshops DML 2008 held as part of CICM multiconference in Birmingham,UK, and DML 2009 held in Grand Bend, Ontario, Canada last year.

Topics of the Workshop included

* search, indexing and retrieval of mathematical documents;* ranking of mathematical papers, similarity of mathematical documents;* math OCR with MathML/TEX output;* document conversions from/to MathML, OpenMath, LATEX, PostScript and

[tagged] PDF;* mathematical document compression;* processing of scanned images;* algorithms for crosslinking of bibliographical items, intext citations search;* mathematical document classification, MSC 2010;* mathematical text mining;* mathematical documents metadata exchange via OAI-PMH and/or OAI-

ORE;* long term archiving, data migration:* reports and experience from math digitization projects;* math publishing with long term archival goal;* software engineering aspects of creating, handling MathML, OMDoc,

OpenMath documents, and displaying them in web browsers.

The four branches of arithmetic – ambition, distraction, uglification and derision.(Lewis Caroll: Alice in Wonderland)

2 On the Road

This year we concentrate on core technologies for building DML, and itis the first year of realizing EuDML, The European Digital MathematicsLibrary, designed as virtual library over existing smaller repositories. EuDML ispresented on page 11 with a paper referring about history of European activitiestowards DML, about project plans and steps to reach its goals.

Invited talk by Masakazu Suzuki speaking about Infty and his tools toautomate math OCR brings Japanese know-how of digitizing big volumesof mathematical papers including structural information and mathematical

http://www.w3.org/Math/


http://www.cedram.org




http://www.openarchives.org/pmh/

http://www.eudml.eu/

Towards a Digital Mathematical Library: On the Road 5

formulae. This is the only verified way of getting texts with mathematics forfulltext search, math retrieval and trustworthy paper similarity computations,on larger scale today. Handling fulltexts with math together with semantichandling of math is the hard way towards semantic-aware math retrieval — formore about semantic ground truth see pages 37–42.

David Ruddy’s paper on page 27 gives timely information on possiblesolution of mathematical metadata exchange format based on the experiencewith running Euclid DML.

There is a DML Building Experience block of papers sharing experience gainedduring building DML-CZ repository. pdfJbIm and pdfsign tools (pages 45–55)are general-purpose tools with impresive results. Other papers refer aboutMetadata editor and validation tools for shaping metadata into usable shareddatabase, and Visual Browser offers alternative graphical interface for DML(meta)data.

Digitization Reports block starting on page 79 brings not only informationabout new Italian baby in the family of DML repositories, but also gives insightabout how the dream of high-energy physicists about their digital library isbecoming a reality.

Finally, respected reader can read a case study about mathematicalcommunication and representation in virtual learning environments on page 83.Tralics as a tool for bridging authors’ world of TEX and applications’ worldof XML/MathML is presented on page 105. Tough arena of mathematicalformulae semantics might be entered on the page 119 with a paper aboutsymbol declaration in mathematical writings.

Ring the bells that still can ringForget your perfect offering

There is a crack in everythingThat’s how the light gets in.

(Leonard Cohen: Anthem)3 Summary

This volume contains the Proceedings of the Workshop Towards a DigitalMathematics Library (DML 2010), organized by the Faculty of Informatics,Masaryk University with the help of CNAM, Paris and held on July 7–8th, 2010in Paris, France, as a satellite event of CICM 2010 (Conference on IntelligentComputer Mathematics). The Proceedings is divided into five parts:

1. Towards a Digital Mathematics Library,2. Digitization Technologies and Platform,3. DML Building Experience,4. Digitization Reports, and5. Tools and Techniques.

My very special thanks go to the Program Committee members for theirhard work during review periods. Most of the submitted papers were reviewed

http://dml.cz

http://www-sop.inria.fr/miaou/tralics/


6 Petr Sojka

by three members of the Program Committee, some even by four. We employedrebuttal phase, where authors were given the possibility to comment on thepreliminary review reports and to answer anonymous reviewer’s questions. Ithelped to increase the quality of final paper versions considerably.

I would also like to express my appreciation to the members of theOrganizing Committee for their efforts in organizing the Workshop andensuring its smooth running, and to CICM general chairs Renaud Riobooand Laurence Rideau.

Last but not least, the cooperation of Masaryk University as a publisher ofthese Proceedings is gratefully acknowledged.

DML 2010 offered a rich program of presentations, short talks/posters,technical papers and [panel] discussions. I hope that another step on the roadtowards fulfilling the dream of the world-wide Digital Mathematics Libraryhas been succesfully completed.

Mathematical Formulae Recognition andLogical Structure Analysis of Mathematical Papers

Extended Abstract

Masakazu Suzuki

Kyushu University, Kyushu, [email protected]

Abstract. In most cases the current on-line journals in mathematics aresupplied in the form of PDF with print images of papers in the frontand OCR’ed hidden texts behind to provide with search facilily usingkey words. The embedded hidden texts usually does not include goodinformation about mathematical formulae in the papers.We can say that, for the future development of DML, it is desirable toinclude, in the digitised journals, more structured information of thecontent of mathematical papers, e.g. tag information to indicate logicalstructure of papers such as headings of sections, definitions, theorems,lemmas, etc., together with mathematical formulae structures included.In the talk, I will present the current stage of our technology to extract suchinformation from the scanned images in the retro-digitised mathematicalpapers. Mechanically-prepared new journals in the form of PDF are alsothe target of our research since it is not an easy task to get uniformstructure description of mathematical formulae for example from theoriginal LATEX source with various styles and macro commands dependingon authors.Although there are many methods presented in literature to recognizemathematical formulae, very few applications appeared to do this task inpractical sense. One of the major problem in the development of math OCRis to avoid fatal effects caused by mis-recognition and mis-segmentationof characters and symbols. In the talk, I will explain first the methodwe took to overcome this difficulty. Some demonstration of our softwareInftyReader to recognize mathematical documents will also be given inthe lecture. Secondly, as a better approach to recognize a large numberof pages like the case of DML, our adaptive method to improve therecognition rates of characters/symbols, mathematical formulae structuresand logical structures of articles will also be presented.





Part II

Digitization Technologiesand Platforms

EuDML—Towardsthe European Digital Mathematics Library

Architecture and Design

Wojtek Sylwestrzak1, José Borbinha2, Thierry Bouche3, Aleksander Nowinski1,and Petr Sojka4

1 Interdisciplinary Center for Mathematical and Computational ModellingUniversity of Warsaw, ul. Pawinskiego 5A, 02-106 Warsaw, Poland

[email protected], [email protected] Instituto Superior Técnico: Computer Science Department, Lisbon, Portugal

[email protected] Institut Fourier (UMR 5582) & Cellule Mathdoc (UMS 5638), Université Joseph-Fourier

(Grenoble 1), B.P. 74, 38402 Saint-Martin d’Hères, [email protected]

4 Masaryk University, Faculty of InformaticsBotanická 68a, 602 00 Brno, Czech Republic

[email protected]

Abstract. The paper describes the background, the expected functional-ities, and the architecture design goals of the European Digital Mathe-matics Library (Eu-DML), an infrastructure system aimed to integrate themathematical contents available online throughout Europe, allowing forboth extensive and specialized mathematics resource discovery. The threeyears long project to build the EuDML, partially funded by the EuropeanCommission, started in February 2010.Key words: bibliography crosslinking, document ranking, digital libraries,DML, EuDML, information systems, information retrieval, citationsdiscovery and extraction, mathematical content search, mathematicalmetadata, mathematics indexing, REPOX, similarity analysis, text mining,YADDA, Web 2.0

1 Introduction

Mathematics is a specific discipline in many respects. One often hears that alla mathematician needs to work is a pencil and a piece of paper. This is notentirely the truth. One of the unique qualities of mathematics is its intrinsicdependence on previous works—one of the basic tool of a mathematician isthus a library. Similarly, many other disciplines of science and research dependon mathematical knowledge, and for them, too, access to a mathematicallibrary is a requisite. In modern times, this translates into the requirementof online availability of the mathematical content. This is why the numerouslocal initiatives first to digitize and then to make local mathematical contentavailable spontaneously started and are still active in many countries. However,










12 W. Sylwestrzak, J. Borbinha, T. Bouche, A. Nowinski, P. Sojka

for a number of reasons, it is necessary to provide an integrated access toall the accumulated material, especially because in mathematics, unlike inmany other disciplines, the language of a publication is to a lesser extent abarrier. A mathematician looking for a certain theorem proof is still concernedabout the place, date or language of the publication, but to a lesser extent thanresearchers in other, less universal disciplines. Therefore it is essential that allthe worldwide mathematical content, although distributed and heterogeneousin nature, is presented in a seamless way, through a unified interface suitablefor the mathematics search purposes, as well as future reliable reference.

This claim already impulsed a large array of international activities sinceyear 2001, when the DML concept emerged from pioneering work by JohnEwing, Philippe Tondeur and Keith Dennis, who initiated a meeting in SanDiego (California) in January 2002, where mathematical societies, digitisationprojects, academic publishers reckoned the need for this infrastructure [22,18].Since then, however, no actual progress towards integration of heterogeneouscollections has been achieved. A one-year (2002–2003) planning project [13]coordinated by Cornell University Library was funded by the U.S. NationalScience Foundation (NSF) “toward the establishment of a comprehensive,international, distributed collection of digital information and publishedknowledge in mathematics”, with a steering committee that happened tobe mostly European. Most of the conclusions from that group are still validtoday: the need for standardization and coordination, the identification ofintellectual property rights and conflict of interests among stakeholders asprincipal inhibitors. In 2005, the Moore Foundation was approached by theAmerican Mathematical Society (AMS) and considered funding a gigantic WorldDML, but faced the same inhibiting factors. However, while no consensuscould be reached in the areas of ownership of the mathematical heritage,or governance of the DML, and no global funding scheme seemed realistic,numerous digitisation projects were running or launched, most of which werebased in Europe. The International Mathematical Union endorsed in 2006 agenerous text [23] written by its Committee on Electronic Information andCommunication (CEIC), but although this probably served as an incentive tolaunch new local DML projects, nothing happened in the area of integrationalthough the EMANI project was a pioneering attempt in this direction [32].

The European Mathematical Society (EMS) was a driving force in manyof these efforts: it wrote an Expression of Interest to the European Union tosupport an application for the European chapter of the DML as an integratedinfrastructure in the 5th Framework programme. In 2009, it contacted theEuropean Science Foundation which considered supporting a foresight studyon a European DML, which showed that almost a decade after the first attempts,we were more or less back to the NSF planning project situation.

One of the reasons for this situation is that commercial publishers hadconsidered for a while using public funding to get their backfiles digitised,and were open to new business models (like JSTOR’s moving wall, or evenNUMDAM’s moving wall) if that happened. But no funding source of the size

http://www.jstor.org/

http://www.numdam.org

EuDML—Towards the European Digital Mathematics Library 13

needed ever surfaced, and each publisher found its own resources, some ofthem public (usually at a national level, often associated with some mandatefor eventual open access in NUMDAM’s line), but many publishers investedtheir own money and thus expected some return, and started to view DMLprojects as competitors.

We believe that this situation will not be resolved by itself, and therefore thatscientists will still lack the necessary infrastructure for handling the referencemathematical corpus, unless a core group of stakeholders takes up the challengeto set up an actual system going beyond the current individual projects’boundaries. This core group must have a critical mass in content, know-how, anda sufficient organisational diversity. They must take a pro-active approach andset the networking infrastructure, standards and policies so that we can furtherbuild on the current state of the art and aim at comprehensiveness in contentwhile expanding geographically. We formed a consortium, and claim that,together with its associated partners (EMS and Göttingen University Library), itdoes form such a core group in position of making the first paradigmatic shiftin this area, thanks to the support it obtained from the European Commission.

2 Overview

Despite the lack of success in building a global Digital Mathematics Library(DML) infrastructure [5,7,6,8], many local initiatives continue their development,and many new DML activities started [2,4,30]. It is estimated that over 200thousand items are already openly available online in European nationalprojects, of which Germany provides around 85,000, France 60,000, CzechRepublic 27,000, Russia 18,000, Poland: 13,000, Spain 6,000, Switzerland 5,000,Serbia 4,500, and Portugal 2,000. A lot of mathematical contents is also ready indigital form owned by commercial publishers like Springer, or Elsevier, thesealas are not yet freely available, mostly due to the restricted access policesassumed by the publishers, who in most cases are also the copyright owners.Finally, a still unidentified number of publications may be available in electronicform in institutional repositories, archived there by their authors (with unknownscientific validation status).

Despite the impressive volume, the currently available digital mathematicscontent is often difficult to access, through a number of isolated interfaces,not registered, difficult to find or virtually inaccessible. Often not adequatelycurated, some of the content may be available in volatile collections and at therisk of perishing.

A group of European stakeholders in DML joined their efforts to draft aproject to build a common DML infrastructure for mathematical knowledgeaccess in Europe. The three years long project, named EuDML, officiallystarted in February 2010 [9]. The EuDML will establish a pilot implementationof an integrated mathematics digital library system allowing for seamlessaccess for otherwise dispersed digital material of the partners and associates.A comprehensive partner institutions list is available at the project’s web





site, http://www.eudml.eu/. The partners include national DML operatorsand mathematical content providers (among contributed collections are:NUMDAM [21], CEDRAM [11], Zentralblatt MATH [20], EMIS’ ELibM [12],DML-CZ [1], DML-PL [37], DML-E [19], DML-PT [3], EDP Sciences’ mathjournals, Bulgarian and Hellenic DMLs [10,15]), digital library technologydevelopers (IST5 developed REPOX [28], ICM6 YADDA [36] and MU7 severalother tools [27,1]), a scientific publisher (EDP Sciences), and experienced serviceproviders (ICM providing large scale content services for years, FIZ8 and CMD9

as well, in the area of mathematical literature). A professional company (MML10)will design the user interface. The European Mathematical Society, will assessthe usefulness of the system and the University Library of Göttingen willcontribute ERAM and RusDML books and journals [14,33,29].

The fact that the available digital maths corpus is already considerableprovides an opportunity to reach the critical mass needed for wide acceptanceof the EuDML infrastructure. The quality and scientific relevance of the freelyavailable content, especially in Europe, is a strong asset. Many EuDML partnershave been active for years, making digitally available a substantial portion ofthe mathematical treasuries produced or published mainly in Europe sincethe XVIIth century. This is of course only one of the necessary conditions, theothers being the overall quality of the proposed tools and solutions, and theiracceptance by users.

3 Design Goals

The primary goal of the system is to provide a “one-stop-shop” unified accessgateway to the distributed mathematical content with innovative servicestailored for a wide range of users. At the same time, the system is expectedto play the role of an infrastructure solution, forming the basis of the futuremathematical knowledge management and provisioning platform. To thisend, the system design has to allow for future seamless integration withrelevant mathematics tools as well as with existing and planned knowledgeinfrastructures. While it is envisaged that the future of the knowledgemanagement will be based on open access principles, and the initial versions ofthe system will focus on open content handling, it should also be capable ofdealing with restricted licenses, and cater for selected intermediate and hybridsolutions, including the moving wall model. EuDML basic policy is to handleonly content for which some kind of moving wall license has been obtained:

5 Instituto Superior Técnico, Lisbon, Portugal.6 Interdisciplinary Center for Mathematical and Computational Modelling, Warsaw,

Poland.7 Masaryk University, Brno, Czech Republic.8 The Berlin editorial office of Fachinformationszentrum Karlsruhe (FIZ) is producing

the Zentralblatt MATH Database and the digital library of journals ELibM.9 Cellule Mathdoc, Grenoble, France.

10 Made Media Ltd. is a digital media agency based in Birmingham, UK.




http://dml.cz



this secures eventual open access to the content that is curated in the project,which appears to be a rather popular and efficient model for mathematicalliterature, given its life span. Besides the metadata, the system will also have thepossibility to store local fulltext copies, both for speedy text mining and fulltextindex rebuild purposes. This might also help setting up eventual archiving ofthe curated corpus.

The key features of the EuDML platform are its extensibility, allowing easyaddition of new services (and content), and its scalability in many dimensions,including the content’s volume, content’s structure, number of services, numberof concurrent users, etc., without performance or reliability degradation. Tothis end, the system will be designed in a modular, distributed architecture,allowing to replace or provide alternative modules realizing the same or similarfunctions in the future.

The EuDML system will enable seamless access to the DML resourcesdistributed through the heterogeneous repositories of the partners and otherpotential content contributors. A number of specific functional requirements arebeing defined, including mathematics-specific content support for consistentmathematical data presentation (of various provenance), or mathematicalformulae search.

While the initial version of the system may be centralized, the requirementfor reliability results in a need for a subsequent migration towards a fullydistributed architecture that would allow for a redundancy and limiting singlepoints of failure and at the same time would provide a better overall scalability.For this reason, the architecture design has to assume a distributed operationsmodel from the very beginning, even if the first prototypes are expected to bedeployed in a simplified environment. Nevertheless, it is absolutely essentialthat the system in all its versions is able to work in an environment of distributeddocument and metadata sources, supporting heterogeneous content providers.OAI-PMH is a primary means of content harvesting but other methods aregoing to be considered and implemented throughout the system’s lifetime.

Besides the import capabilities, EuDML will provide access to its storedcontents to external services, through specified access interfaces. A secureintegration with commercial content providers services, and possibly also withselected federated authentication systems is also envisaged. While securitymay not seem to be the critical feature of a DML system, certain aspectsrequire careful consideration. As all distributed systems relying on the openInternet for inter-process communication, EuDML by its nature may be proneto service impersonification and denial of service (DoS) type attacks. Whileservice-to-service authentication and authorisation mechanisms will haveto be implemented to guarantee the system’s integrity, service and databackup-restore and redundant load balancing functions will guarantee itsadequate availability. In order to be able to handle non-free content, specialcare will be paid to access control and access accounting functions. Finally,user authentication will allow end users to customize the work environmentsattached to their accounts, annotate content, and use the system’s community





Fig. 1. EuDML functional vision

cooperation and other Web 2.0 functions. The registered authors should be ableto claim their works and create their own bibliography portfolios.

The end user features include an efficient way of content presentationthrough a custom designed web user interface integrated with Web 2.0 contentenrichment features. Besides the common search/browse/display scenarios,EuDML will offer the user a number of personalisation features, communityand collaboration support services, and content enrichment tools (annotations,personal keywords, personal ranking etc.).

The initial functional expectations of the system, presented in the diagramin Figure 1, conceptually consist of a metadata repository, a search engine,a metadata enhancer, an association analyser, annotation and accessibilityfunctions and of course the interfaces. Each of these abstract concepts willactually materialize as a number of functional components in the functionalspecification, and may eventually consist of a range of different tools andservices, that will be improved and extended over time, and that will be able tohandle different aspects of the expected functions. The metadata repository willprovide the central point of reference for all the managed contents. It will workdirectly with an OAI-PMH harvester to ingest repositories’ content descriptions,will be able to map the metadata into the internal EuDML schema, will provideitem identifier resolution facilities, and store the metadata, and a copy of theactual fulltext content when appropriate. The performance and the quality ofresponses of the search service will directly influence user experience. Therefore,particularly this service has to be reliable, scalable and customized to fulfil userexpectations. Apart from the common search functionality, innovative solutionsfor searching in mathematical contents will be sought. The metadata enhancerfunction will consist in a collection of tools that each will contribute to expandor complete the existing items’ metadata, depending on the improvementsneeded. These will range from applying OCR over full texts, adding key wordsor multilingual metadata by merging information from different databases



when an item happens to have such non-redundant description, generatingMathML for mathematical expressions, etc. The association analyser will be ableto detect, analyse and record relations between individual items. The annotationcomponent will provide mechanisms to attach new material to individual itemsin the repositories and maintain this new material. The accessibility componentwill provide support for enhanced accessibility of items, if required, beforepresentation to end users. Finally, the user and system interfaces will provideaccess to the collected resources on different levels both to human and machineusers, respectively. They will also provide interfaces for integration with otherknowledge infrastructure and third party services.

The authors encourage the community to discuss the presented system’sfeatures and provide suggestions for additional functionalities, considered tobe vital for the broadly defined platform’s usability.

4 System Architecture

Based on a detailed functional requirements specification, the EuDML system’sarchitecture is going to be designed. The extensibility, scalability and reliabilityrequirements lead to an eventually fully distributed, platform independent(Java based) solution. While the general architecture design will follow theService Oriented Architecture (SOA) paradigm, the communication layer willremain abstract, so that individual services will be able to communicate throughdifferent adequate means, including possible direct connection when deployedin a single location (host). On the other hand, universal SOAP communicationwill be maintained as the default for flexibility and compatibility reasons,and more lightweight remote communication (e.g. REST or other content-driven) will be also possible, where required. For this purpose a layered servicestructure will be adopted, presented in more detail at the end of this section onpage 20.

The performance requirement results in the need of a careful selectionof critical processing components, and where possible, mature and proventechnologies will be used, with which the partners have adequate experience.The additional benefit of using the partner developed technologies will bethe system’s sustainability after the end of the project, when the partnersresponsible for individual services will be able to continue curating themby further developing the relevant software code, while keeping EuDML’scompatibility and other specific requirements in mind.

4.1 Modular Design

The modular design principle will pertain not only to the backend services butto the user interface as well, allowing for existence of alternative user interfaces,or embedding EuDML portlets in third party services, where required.

Each of the functional modules presented in the previous section shallbe realized by means of a separate service or a group of services acting



MetadataEnriching

Process management services(indexing, analysing, enriching etc.)

Authoritative services

Metadata(+annotations)

Store

Remoting interface

ContentStorageService

Remoting interface

Indexing and enriching services

SearchService

Remoting interface

StructuredBrowseService

Remoting interface

CitationService

Remoting interface

Other services

Remoting interface

Web User InterfaceOAI-PMH access

Other interfaces

DataHarvesting

AndImport

Annotations

Fig. 2. EuDML core and extension services

together. At the design stage each service will be characterized by its formalcontract definition, which will be subsequently used for testing purposes (andparticularly for the regression tests in a continuous integration environment).At the same time, a service’s contract will define its behaviour and ascertainthe fulfilment of the desired functional requirements.

Many technologies and solutions that are required in EuDML alreadyexist either developed by the project partners or elsewhere, and in manyareas new developments will not be required. However, a careful design andevaluation of the alternatives will always have to be carried out. Also, inorder to assure a proper follow-up, the evolving environment will have tobe taken into account, and EuDML will not be limited to its initial specificfunctional requirements only, but will also conceive other requirements forwider interoperability. Where possible, the system backend modules will bebased on existing partner-developed code of proven and deployed services, inorder to economize the development effort, capitalize on partners’ experienceand secure the future sustainable development.

The system, based on Service Oriented Architecture, will consist of a setof core services and a number of extension or enriching services. The coreservices are defined as a set of services required and sufficient for the basic


system operation. They include the publication metadata store, the indexingand the search services, the content storage system and structured publicationbrowsing services. The architecture concept outline is presented in Figure 2on the preceding page. The core services realize the functions of the metadatarepository, and search engine, described in the previous section.

4.2 The Metadata Store

The Metadata Store will be composed of several separate services acting to-gether: a Metadata Repository Manager; a Storage Manager; and a MetadataRegistry. The Metadata Repository Manager REPOX has been chosen. RE-POX [28] is a framework developed in the scope of the project TELplus andalready deployed in the TEL central service and in several TEL data providerlibraries (it also is being redesigned in order to be used by Europeana project).The YADDA Storage Service is our metadata storage tool of choice. YADDA [35]is a service oriented distributed digital content management and provisioningplatform, originally developed for Polish national Virtual Library of Scienceproject, with its core components successfully deployed in several large Euro-pean production content infrastructures (e.g. DRIVER [25], or OpenAIRE [24]).REPOX and YADDA are capable of managing a large number of data sourcesand storing large quantities of heterogeneous records, with additional versioncontrol support, tagging and other required control features. This will allowstoring not only bibliographical records in various schemas but also user cre-ated content referring to the custom work. An important requirement also willbe the assignment of a persistent identifier to each entity (metadata record ordocument) and the related resolution service able to point to a local copy andback to the original resource’s locations.

Despite the fact that EuDML will be using its internal common metadataschema, it is anticipated that multiple different metadata patterns will be usedby different content providers and data sources, and these will have to bereversibly mapped onto the internal data structure. That implies the MetadataStore will have to support adapting the different forms of metadata that eachprovider has about their items to the common format required by the EuDML.For that purpose, the Metadata Repository also will include, in its internalarchitecture, a Metadata Registry (MDR). Besides the traditional references forthis work [16,17], the results of the XMDR project [34] will be also considered.For the purpose of the EuDML Metadata Registry, the MDR technology alreadydeveloped in the TELplus project and in deployment in TEL will be evaluated,it is being made more generic and robust in the scope of Europeana project. Forthe Search Service and Structured Browse Service relevant YADDA serviceswill be considered, which proved to be mature, stable and performant, andare already successfully used in a number of European large scale contentinfrastructures. The YADDA Search Service, based on the Apache Lucenesearch system is well prepared to be installed and integrated within the wholesystem. Subsequently, a Solr [31] based YADDA Search Service version will be


evaluated, and solutions that enhance the indexing system with support formath formulae will be sought.

4.3 Extensions

The other expected functions, including annotation, accessibility, metadataenhancer or association analyser will be realised as separate enriching services,that while following the same service design principles, will be consideredextension services. Examples of such services include the Citation Service,responsible for citation resolving and indexing or Similarity Service [26,27],which would be able to return similar objects based on a predefined metricsand criteria. Similarly, additional extension services are hoped to be developedin the future by third parties or by the involved partners.

4.4 Interfaces

One or more Web User Interface Services will be developed, based on userrequirements criteria. Functional interfaces and widgets may also be preparedto make it possible to include an “EuDML Search Box” in other local systemsand portals. A widget configurator may be developed, making it easy for usersto create tailored search interfaces for their own websites. Other functionalinterfaces will be also designed and implemented for services related tointeroperability, based on common standards such as OAI-PMH, or on Webservices, following the SOAP or REST paradigms where reasonable. All thesystem’s operations will be managed through a process management service,which will be responsible for operation scheduling, synchronisation and timelyexecution, and for the overall system level integrity of the services.

On a different level, the authorization and authentication services willensure secure service-to-service operations, and users authentication requiredto personalize their accounts environment and to access any restricted contents.A YADDA AAS2 service, playing a similar role in European Repository DRIVERinfrastructure will be considered for this purpose.

4.5 Layered Services

An important feature of the EuDML system architecture is its layered servicestructure, as seen in Figure 3 on the facing page. Each service has a pluggablelayer for service remote access, allowing to avoid the necessity to select asingle service access method and then to support only it throughout thesystem’s lifetime. In EuDML, each service will be defined using a commonpattern of a separate API for the client interface (service facade) operating ona common service interface suitable for remote access. A proper definition ofthe service interface layer, with error encapsulation and strict request-responsepattern makes it ideally suited for enabling remote access with any of theexisting technologies, including SOAP Web Services, RESTful services, RMI,



Service Facade API

Service readinterface

Service Writeinterface

Service logic

Service readinterface

Service Writeinterface

Software accesesService

Custom remoting protocol

Fig. 3. EuDML service layers

HTTP invoke or no remoting (direct invocation) at all. At the same time thefacade interface can be easily extended with any method required. Using wellestablished enterprise frameworks, like Spring, will allow to define remoteoperation method independently of the source code. In a centralized, highperformance scenario, local services will be able to talk to each other directlythrough their service interfaces, while in a rich, distributed environment, welldocumented and stable Service Facade APIs shall be exposed for other servicesto use. This added level of flexibility will allow the infrastructure or subsetsthereof to be installed and used in various deployment scenarios, from a highperformance single SMP machine based centralised system, through a multiplelow-end machines deployment to eventually a global distributed infrastructure.

It is expected that after completion of the initial design phases, where thedetailed functional specification and system architecture, internal metadataformat and metadata mappings will be defined, a prototype system will beimplemented. The fully functional system is expected to be deployed at the endof 2012.


5 Main Challenges

One of the typical challenges of any large scale digital library system dealingwith heterogeneous, distributed content is the optimal metadata harmonisation.While all the typical issues related to content multilingualism, different dataprovenance and quality, versioning, duplicate content, or need to mergecontrolled vocabularies are present in EuDML, it is hoped that through theuse of adequate flexible tools it will be possible to reiterate and optimize thechosen solution. Additionally, exploiting the fact that all the content in questionis either mathematical by its nature or closely related to mathematics, shouldallow to apply mathematical knowledge management techniques to overcomethe barriers and find the mathematics-specific relations between individualobjects and their groups. To this end, provisions for extensive text mining willbe supported, and adequate data structures, and appropriate analytic tools willbe designed.

Besides the metadata harmonisation issues, it is apparent that in order tobe adequately comprehensive, a mathematical knowledge environment has tobe able to manage both open access and restricted licensed content. On thetechnical level, this will be handled by the AA service developed at ICM andsuccessfully used in DRIVER and YADDA systems (where open content has toseamlessly co-exist with millions of restricted licensed articles and books fromcommercial publishers). However, the challenge lies in the careful design oflicense usage scenarios, so that the system would be considered trustworthy bythe commercial mathematical content providers.

An obvious challenge will be not only implementing the mathematicsspecific user interface functions (such as formulae presentation, which parsingis still often a performance bottleneck) but also supporting mathematics specificfunctions by the backend services (such as formulae search). In the future, closeintegration with existing mathematics tools and environments is planned.

An important challenge, probably rudimentary for the EuDML’s success,although not of technical nature will be to position the service as a recognizedauthoritative information source not only in the mathematicians environment,but also for the services and users of all other disciplines, searching formathematical knowledge. One of the prerequisites for this will be to select andaccumulate a critical mass of quality content, trusted and relevant to our users(this is assured by the content contributed by the partners) and services (whichis the reachable goal described in this paper) taking advantage of the potentialand the resources of the project partners. Another issue, not to be neglected, ismaintaining high level of true interoperability with other subject specific andgeneral content infrastructures not only on European but also internationallevel.

6 Future Directions

One of the important issues not being addressed directly at the present stateis the long-term preservation of the mathematical content. It is feared that


without proper preservation and curation activities some of the publicationseventually will be, and some probably already are lost to the community. Whilethe proposed DML system currently does not encompass the complete long-term preservation procedures, it will provide the infrastructural support fortheir future implementation. Moreover, providing the centralized registry viewand control mechanisms, it will make it possible to detect the endangereditems in advance. Besides, EuDML may keep electronic copies not only of themetadata but also of the original content fulltexts for its internal processingpurposes, that can be also used for emergency recovery. It is essential that theseopportunities are exploited and proper long-time preservation strategies aredesigned and implemented as soon as possible.

Another issue to be considered is the eventual full distribution of the service.While a complete distribution may not be viable for performance reasons, itshould be considered as an additional service reliability measure. The systemis being designed with a distributed model of operation in mind from the verybeginning. However it is assumed, that for the first several years, it will operatefrom ICM’s redundant servers in Warsaw. As ICM operates three separatedatacentres, it is planned that servers in two distinct locations in Warsaw willbe used. Only after the systems fault tolerance and failover mechanisms areimplemented and verified, would a larger scale distribution be reasonable.

A larger long-term organizational and technological challenge will beencompassing all the mathematical contents beyond our first day partners’, firstby cooperating with more European collections, publicly or privately owned,freely or non-freely accessible, then by extending geographically beyond thecontinent’s boundaries. One of the important future issues is to establish amodel for cooperation with the publishers. Finally, it is planned to expandthe EuDML infrastructure to integrate external tools and environments, bothmathematics related and other relevant. Particularly valuable will be EuDMLcontent and services integration with other international knowledge and contentmanagement infrastructures.

7 Conclusions

After many years of efforts, finally the dream of many European mathematiciansto have a common digital library of mathematical contents comes true. This isthe first breakthrough towards a universal DML since the advent of isolateddigitisation programmes around the world at the turn of the 21th century. Wethus hope it will help shape and drive the forthcoming efforts towards a morecomprehensive worldwide digital mathematics library. Special attention has tobe paid in order not to waste this chance.

While the current consortium, building the EuDML system, forms a stronggroup of reliable partners, the system’s sustainability and further developmentafter the project ends will be challenge. It is identified as such and will bespecifically addressed as one of the activities scheduled in the project. However,some of the provisions allowing us to trust in the true EuDML sustainability


include the design of gradual development through modular upgrades andextensions of the technologies already developed by individual partners,relatively wide community support from the very beginning of the project,with many stakeholders directly involved and the EMS chairing the advisorycommittee, and the fact that the infrastructure will be directly maintainedand operated by the partners with many years experience in quality serviceprovisioning.

The open questions remaining include subsequent funding scheme, need todesign and implement archival policies, the way to cooperate most efficientlywith commercial publishers, and the way towards eventually opening access toall mathematical contents.

Acknowledgement

EuDML project is partly financed by the European Union through its Com-petitiveness and Innovation Programme (Information and CommunicationsTechnologies Policy Support Programme, “Open access to scientific informa-tion”, Grant Agreement no. 250,503).

References

1. Miroslav Bartošek, Martin Lhoták, Jirí Rákosník, Petr Sojka, and Martin Šárfy. DML-CZ: The Objectives and the First Steps. In: Borwein et al. [4], pages 69–79.

2. Hans Becker, Kari Stange, and Bernd Wegner, editors. New Developments inElectronic publishing. FIZ Karlsruhe, 2004. http://www.emis.de/proceedings/Stockholm2004/.

3. José Borbinha. Digital Libraries and the Rebirth of Printed Journals. In: Borweinet al. [4], pages 97–110.

4. Jonathan Borwein, Eugénio M. Rocha, and José Francisco Rodrigues, editors.Communicating Mathematics in the Digital Era, MA, USA, 2008. A. K. Peters.

5. Thierry Bouche. Introducing the mini-DML Project. In: Becker et al. [2], pages 19–29.http://www.emis.de/proceedings/Stockholm2004/bouche.pdf.

6. Thierry Bouche. Some Thoughts on the Near-Future Digital Mathematics Library.In: Sojka [30], pages 3–15. http://dml.cz/dmlcz/702540.

7. Thierry Bouche. Towards a Digital Mathematics Library? In: Borwein et al. [4],pages 43–68.

8. Thierry Bouche. Digital Mathematics Libraries: The Good, the Bad, the Ugly.Mathematics in Computer Science, 3(3):227–241, May 2010. http://dx.doi.org/10.1007/s11786-010-0029-2.

9. Thierry Bouche. Introducing EuDML—The European Digital Mathematics Library.EMS newsletter, 76:11–16, June 2010. http://www.ems-ph.org/journals/journal.php?jrn=news.

10. BulDML. Bulgarian DML at the Institute of Mathematics and Informatics at theBulgarian Academy of Sciences. http://sci-gems.math.bas.bg/.

11. CEDRAM. Centre de diffusion de revues académiques mathématiques. http://www.cedram.org/.

12. ELibM. The Electronic Library of Mathematics of the European MathematicalInformation Service. http://www.emis.de/journals/.

http://www.emis.de/proceedings/Stockholm2004/

http://www.emis.de/proceedings/Stockholm2004/

http://www.emis.de/proceedings/Stockholm2004/bouche.pdf

http://dml.cz/dmlcz/702540

http://dx.doi.org/10.1007/s11786-010-0029-2

http://dx.doi.org/10.1007/s11786-010-0029-2

http://www.ems-ph.org/journals/journal.php?jrn=news

http://www.ems-ph.org/journals/journal.php?jrn=news

http://sci-gems.math.bas.bg/

http://www.cedram.org/


http://www.emis.de/journals/


13. Sarah E. Thomas et al. Digital Mathematics Library. Final Report. TechnicalReport NSF Award # DUE-02066-40, Cornell University, 2004. http://www.library.cornell.edu/dmlib/DMLreport_final.pdf.

14. Thomas Fischer. The Digitization Registry at SUB Göttingen. A Step towards a DMLRegistry. In: Becker et al. [2], pages 55–63. http://www.emis.de/proceedings/Stockholm2004/fischer.pdf.

15. HDML. Hellenic DML at Ionian University. http://dspace.eap.gr/dspace/handle/123456789/46.

16. ISO/IEC JTC1 SC32 WG2. ISO 11179—Metadata Registries (MDR), 2009. http://metadata-stds.org/11179.

17. ISO/IEC JTC1 SC32 WG2. ISO/IEC 20944—Metadata Registry Interoperability andBindings (MDRIB), 2009. http://metadata-stds.org/20944.

18. Allyn Jackson. The Digital Mathematics Library. Notices of Am. Math. Soc., 50(4):918–923, 2003. http://www.ams.org/notices/200308/comm-jackson.pdf.

19. Enrique Macias-Virgós. Some Digitization Initiatives in Spain. In: Becker et al. [2],pages 137–142. http://www.emis.de/proceedings/Stockholm2004/virgos.pdf.

20. Zentralblatt MATH. A reviewing database. http://www.zentralblatt-math.org/zmath/.

21. NUMDAM. Numérisation de documents anciens de mathématiques. http://www.numdam.org/.

22. Committee on Electronic Information Communication of the International Ma-thematical Union. Best current practices: Recommendations on electronic in-formation communication. Notices Am. Math. Soc., 49(8):922–925, 2002. http://www.ams.org/notices/200208/comm-practices.pdf.

23. Committee on Electronic Information Communication of the International Mathema-tical Union. Digital Mathematics Library: A Vision for the Future, 2006. Endorsedon August 20, 2006 by the General Assembly of the International MathematicalUnion. http://www.mathunion.org/ceic/Publications/dml_vision.pdf.

24. OpenAIRE. Open Access Infrastructure for Research in Europe. http://www.openaire.eu/.

25. DRIVER project. Networking European Scientific Repositories. http://www.driver-repository.eu/.

26. Radim Rehurek and Petr Sojka. Automated Classification and Categorizationof Mathematical Knowledge. In: Serge Autexier, John Campbell, Julio Rubio,Volker Sorge, Masakazu Suzuki, and Freek Wiedijk, editors, Intelligent ComputerMathematics—Proceedings of 7th International Conference on Mathematical KnowledgeManagement MKM 2008, volume 5144 of Lecture Notes in Computer Science LNCS/LNAI,pages 543–557, Berlin, Heidelberg, July 2008. Springer-Verlag.

27. Radim Rehurek and Petr Sojka. Software Framework for Topic Modelling withLarge Corpora. In: Proceedings of LREC 2010 workshop New Challenges forNLP Frameworks. pp. 45–50. Valletta, Malta (2010), software available at http://nlp.fi.muni.cz/projekty/gensim.

28. REPOX. A Metadata Repository Manager. http://repox.ist.utl.pt/.29. RusDML. Russian DML. http://www.rusdml.de/.30. Petr Sojka, editor. Towards a Digital Mathematics Library, Birmingham, UK, July 2008.

Masaryk University. http://www.fi.muni.cz/~sojka/dml-2008-program.xhtml.31. Solr. An open source search platform. http://lucene.apache.org/solr/.32. Bernd Wegner. EMANI—Leader and Follower for the WDML. In: Becker et al. [2],

pages 161–169. http://www.emis.de/proceedings/Stockholm2004/wegner.pdf.

http://www.library.cornell.edu/dmlib/DMLreport_final.pdf

http://www.library.cornell.edu/dmlib/DMLreport_final.pdf

http://www.emis.de/proceedings/Stockholm2004/fischer.pdf

http://www.emis.de/proceedings/Stockholm2004/fischer.pdf

http://dspace.eap.gr/dspace/handle/123456789/46

http://dspace.eap.gr/dspace/handle/123456789/46

http://metadata-stds.org/11179



http://www.ams.org/notices/200308/comm-jackson.pdf

http://www.emis.de/proceedings/Stockholm2004/virgos.pdf

http://www.zentralblatt-math.org/zmath/

http://www.zentralblatt-math.org/zmath/

http://www.numdam.org/

http://www.numdam.org/

http://www.ams.org/notices/200208/comm-practices.pdf

http://www.ams.org/notices/200208/comm-practices.pdf

http://www.mathunion.org/ceic/Publications/dml_vision.pdf

http://www.openaire.eu/

http://www.openaire.eu/

http://www.driver-repository.eu/

http://www.driver-repository.eu/

http://nlp.fi.muni.cz/projekty/gensim


http://repox.ist.utl.pt/

http://www.rusdml.de/

http://www.fi.muni.cz/~sojka/dml-2008-program.xhtml

http://lucene.apache.org/solr/

http://www.emis.de/proceedings/Stockholm2004/wegner.pdf


33. Bernd Wegner. RusDML 2008: Current Facilities of the Core Archive of DigitizedRussian Publications in Mathematics. In: Sojka [30], pages 83–86. http://dml.cz/dmlcz/702547.

34. XMDR. eXtended MetaData Registry. https://xmdr.org/.35. YADDA. A digital content management and provisioning platform. http:

//yaddainfo.icm.edu.pl/.36. Katarzyna Zamłynska, Łukasz Bolikowski, and Tomasz Rosiek. Migration of the

Mathematical Collection of Polish Virtual Library of Science to the YADDA platform.In: Sojka [30], pages 127–130. http://dml.cz/dmlcz/702538.

37. Katarzyna Zamłynska, Alek Tarkowski, and Tomasz Rosiek. Evolution of the Mathe-matical Collection of the Polish Virtual Library of Science. Mathematics in ComputerScience, 3(3):265–278, May 2010. http://dx.doi.org/10.1007/s11786-010-0022-9.



https://xmdr.org/

http://yaddainfo.icm.edu.pl/

http://yaddainfo.icm.edu.pl/


http://dx.doi.org/10.1007/s11786-010-0022-9

Developing a Metadata Exchange Formatfor Mathematical Literature

David Ruddy

Project Euclid, Cornell University Library107 D Olin Library, Ithaca, New York, 14853, USA

[email protected]

Abstract. This paper describes an effort to develop a metadata elementset for the exchange of descriptive metadata about mathematical literature.The approach taken uses the Dublin Core Application Profile (DCAP)framework, based on the DC Abstract Model. A fully developed DCAP formathematical literature would be valuable, as both a guide and constraintin the creation of metadata records suitable for harvesting via OAI orsharing through other means. Adhering to the DCAP model wouldalso enhance global interoperability with other metadata schemes. Thesuccessful development of a DCAP for mathematical literature, however,will require broader DML community input to resolve open issues andgain acceptance.

Key words: metadata standards, metadata exchange, Dublin CoreApplication Profile

1 Introduction

In order for repositories to share rich metadata about mathematical publications,the Digital Mathematics Library (DML) community will need to reachagreement on a metadata exchange standard. Currently, the only exchangeformat used in common across multiple repositories is simple, unqualifiedDublin Core (DC), the 15 descriptive metadata terms originally designedin 1995 [1]. One reason for this is that simple DC is the default, minimalrecord format required by OAI-PMH, a frequently used mechanism for sharingmetadata [2].

Simple DC appears easy to use, and it is almost universally recognized.It has, however, a number of disadvantages, most of which are related to itsperceived strength: simplicity. As a carrier of descriptive information, it is veryconstrained. The usefulness of aggregating even high-quality simple DC recordsis therefore debatable, as the range of functionality that can be supported is solimited. And yet a more pressing problem may be the difficulty of obtaininghigh-quality, conformant, and consistent metadata when harvesting simpleDC records from numerous independent repositories. Since it was designedto be applicable to such a wide range of materials, it is not well-suited formost particular content types. Those using it for specific purposes are therefore





28 David Ruddy

tempted to embed qualifications in ways that simple DC was not designed for,or to otherwise use elements in ways that strain the original element definitions.The resulting impact on metadata quality and consistency and the consequentchallenges in building reliable services on top of such aggregated data havebeen described [3,4,5].

While obtaining quality metadata from independent repositories will alwayspresent challenges, we argue that a positive step forward would be the creationof an element set that was both richer and more rigorous than simple DC andthat was designed specifically for mathematical literature. Such a metadata setwould give content repositories a full set of well-defined elements, so that theywould not need to overburden terms or guess where to put descriptive data.This would likely improve metadata quality and consistency. At the same time,a richer element set would support greater functionality once metadata washarvested and aggregated.

2 Dublin Core Application Profiles

The Dublin Core Metadata Initiative has provided a framework for the designand documentation of metadata applications. This effort recognizes both theunique needs of particular communities and the benefits of a shared approach.A Dublin Core Application Profile (DCAP) is a combination of precise elementdefinitions and usage guidelines. A DCAP is not limited to DC elements—it canuse terms defined in other namespaces. The major constraint on the design of aDCAP is that it adhere to the Dublin Core Abstract Model [6]. The semantics ofthis model are built on the Resource Description Framework (RDF). Amongother things, this requires that referenced properties (terms, elements), syntaxschemes, and vocabularies all be properly declared in an RDF schema and thusidentifiable with URIs. While this is not an insignificant constraint, the potentialbenefits of using globally defined properties and vocabularies are precision andsemantic interoperability. This has important consequences for the usefulnessof metadata in Semantic Web or Linked Data applications [7].

If an acceptable DCAP can be developed, it would likely provide forthe widest usefulness in a global context. But even if the DML communityeventually decides to take another approach, a thorough exploration of theDCAP framework and its requirements will be valuable. The requirements of aDCAP are useful to consider for any community metadata effort. There are alsoother DCAP projects that can serve as models and provide useful properties andencoding schemes, such as the Scholarly Works Application Profile (formerly,Eprints Application Profile), and the DC Collections Application Profile [8,9].Both of these profiles have been used in the present effort.

Compliance with the DCAP framework is well-defined [10,11,12]. Thefollowing section briefly describes the necessary components of a DCAP andproposes a response for how a Mathematical Literature Application Profile(MLAP) could meet these requirements. This work is built on an earlier effortto create recommendations for using simple DC for mathematical literature,

Developing a Metadata Exchange Format for Mathematical Literature 29

begun in 2005 [13]. Participants in that effort were Thierry Bouche (InstitutFourier & Cellule MathDoc), Thomas Fischer (Staats- und Universitätsbibliothek,Göttingen), Claude Goutorbe (Cellule MathDoc), and David Ruddy (ProjectEuclid). In particular, many of the usage recommendations concerning contentvalues are derived from that work.

3 Mathematical Literature Application Profile

3.1 Functional Requirements

Metadata does not exist for its own sake but to support desired functionality. Itis important, therefore, both initially and as the profile develops, to understandclearly how we intend to use the metadata governed by this application profile.Establishing use cases and functional requirements is a process of communitynegotiation and agreement. It is through these discussions that a shared sense offunctional scope is established, which will then provide rationale and guidancefor the design of particular metadata constructs. For these reasons, the DCMIrequires that a DCAP include functional requirements.

Two broad functional objectives of the proposed MLAP can be described.One is to provide a mechanism for the exchange of richer and more consistentmetadata among repositories of mathematical literature than is currentlypossible with simple Dublin Core. This will contribute to the developmentof a “world digital mathematics library” by providing the means by whichrepositories can share more complete and uniform metadata about theirholdings, and service providers can build more reliable services on top ofthat aggregated metadata. Another objective, achieved by using the DCAPapproach, is to position MLAP metadata so that it can participate in a globalsemantic environment, envisioned by the Semantic Web and Linked Datamovements.

More specific functionality that the MLAP should support includes:

• the discovery of publications:� by means of fielded searching on various attributes, including titles,

author names, subjects, and abstracts.� by means of browsing, beginning at a journal, book, or other high-level

publication title.� by means of filtering search and/or browse results based on attributes

such as publication type, date of publication, language, access restric-tions, parent publication, etc.

• the identification of publications of interest, from among many, by allowingfor the collection and display of identifying attributes such as a DOI or otherunique identifier, title, author, and publication details (date of publication,publication name, publisher, etc.).• the selection of publications of interest, from among many, by allowing for

the collection and display of attributes such as subject, format, publicationtype, language, and restrictions on access.

http://www.doi.org

30 David Ruddy

• the acquisition of a copy of the publication by providing a DOI or othernetwork resolvable URI, together with information about access restrictions.

• the capture, display, and indexing of titles and abstracts in multiplelanguages or transliterations.

• potential additional capabilities or services, such as links to name authorityresources, citation analysis, OpenURL linking, and rich subject analysis.

As currently proposed, the MLAP is for the description of network-accessible, published literature in mathematics and statistics. Although thisprofile could be used for author copies and pre-prints, it is optimized forformally published literature. Adequate description of pre-prints would requireadditional properties to describe document versions and to record a greaterrange of date attributes. Other functionality that is currently out of scopeincludes:

• the description of publications that do not have copies available online.• the identification and description of distinct FRBR entities [14] (such as

handled by the SWAP application profile [8].)• the capture of structured author and contributor descriptions, so as to

include role, affiliation, email address, etc.• the capture of machine-processable descriptions of access embargo periods.

3.2 Domain Model

A DCAP domain model is a representation of the distinct entities that willbe described by the metadata application and the relationships among thoseentities. It defines the overall scope of the application profile. Either graphicdepictions or text descriptions can be used. The entity model for the proposedMLAP is relatively simple. There are only two entities: publication andpublicationContainer, with a single relationship:

publication may be part of a single publicationContainer

Defining a publicationContainer allows us to capture an unambiguousand easily accessible description of the parent publication, such as a journalissue or monograph. A potential additional entity is “author” or “creator,” butwe feel that the MLAP is not the appropriate place to maintain rich authordescriptions, such as affiliation, email address, etc. The MLAP allows for theuse of a URI in the creator property, and we anticipate linking to more detailedauthor descriptions (or better, using an authoritative name identifier), ratherthan capturing that information internally.

3.3 Description Set Profile

The DCAP Description Set Profile (DSP) provides a detailed definition of theapplication’s metadata record. The DSP is based on the DCMI DescriptionSet Model, which is part of the DC Abstract Model. The DSP is expressed

http://www.doi.org


by means of templates and constraints, the use of which is defined by a DSPconstraint language [15]. The repeatability of properties and the restrictions onallowed property values are all explicitly defined by the DSP. Adherence to theconstraints defined in the DSP determines the validity of all metadata recordsof a particular application profile. In essence, the DSP is the definition of theDCAP.

An XML expression of the complete DSP for a proposed MLAP ismaintained online [16]. The root level DescriptionSetTemplate contains twochild DescriptionTemplate elements (publication and publicationContainer),which represent the two entities of the domain model. Each of these in turncontain a number of StatementTemplate elements, which make property=valueassertions about the entities. The various constraints upon the value of aparticular property are expressed within the StatementTemplate.

A much simplified presentation of information contained in the DSP isprovided in tabular form in the Appendix. The namespaces used for propertiesand encoding schemes in the MPAP are found in Table 1.

Table 1. MLAP Namespaces and Namespace URIs

Properties

DCMI Metadata Terms http://purl.org/dc/terms/DC Collections Metadata Terms http://purl.org/cld/terms/PRISM: Publishing Requirements forIndustry Standard Metadata

http://prismstandard.org/namespaces/basic/2.0/

Syntax encoding schemes

DCMI Metadata Terms http://purl.org/dc/terms/NISO OpenURL Framework Registry info:ofi/

Vocabulary encoding schemes

DCMI Metadata Terms http://purl.org/dc/terms/Eprints Terms http://purl.org/eprint/terms/

3.4 Usage Guidelines

While usage guidelines are not explicitly required by the DCAP framework, theyare rather critical for the successful use of the application profile. Guidelinestranslate the DSP into a human-readable format, as well as provide rules thatapply to content values. For example, guidelines would include an originalproperty definition (e.g., “An entity primarily responsible for making theresource”), any local use that refines that definition (e.g., “An author of the

http://purl.org/dc/terms/

http://purl.org/cld/terms/




info:ofi/


http://purl.org/eprint/terms/

32 David Ruddy

publication”), whether the element is optional and repeatable, whether andhow the element values are restricted or datatyped, and any “cataloging rules”that should be applied to value strings (e.g., “family name, followed by comma,then space, followed by given name”). Other less prescriptive recommendationsmay also be included.

A complete usage guideline for the currently proposed MLAP is maintainedonline [17].

3.5 Syntax Guidelines

The DCAP framework is neutral regarding the encoding syntax used toexpress and transmit metadata records. DCAP conformant metadata willby definition adhere to the DC Description Set Model defined in the DCAbstract Model. DCMI has provided several publications that specify howto serialize a DC metadata description set in plain text, XML, RDF, andHTML/XHTML [18,19,20,21].

At this point, no recommendations are made regarding the syntacticexpression of MLAP metadata. Examples found within the usage guidelinesare expressed in plain text. In time, these will be linked to other potentialserializations, which can serve as encoding models.

4 Design Considerations, Open Issues, Next Steps

Developing a metadata scheme requires balancing richness and complexityagainst simplicity and ease of application. If it is too simple, the resultingdescription may not support desired functionality, but if it is too complex, fewwill apply it accurately or use it at all. Attempting to achieve an optimal balancehas influenced several design considerations in the present effort. For example,the proposed MLAP has relatively few required elements. Valid metadatarecords can include only a title, a publication date, a bibliographic citation,and a URL to the online resource. Of these, only the publication date hasan enforced encoding syntax. At one extreme, therefore, the profile providesa relatively low-barrier means of sharing metadata. At the other end of thespectrum, data providers can construct very rich metadata records by includingmultilingual values, MathML in titles and abstracts, complete reference lists,and OpenURL Context Objects containing machine-processable bibliographicdata (describing the primary resource as well as references).

Another design choice was to use several distinct and dedicated identifierproperties rather than a single multi-use one. The DC identifier element couldhave been used to capture the identifiers prism:url, prism:issn, prism:eIssn, andprism:isbn. (It could also be used to capture an HTTP addressable version ofprism:doi.) It was felt, however, that providing dedicated elements will reduceuncertainty and ambiguity in the preparation and interpretation of MLAP meta-data. Following the same reasoning, it seemed advantageous to create a distinctand easily interpretable entity, publicationContainer, to hold a description of


the parent publication. The same descriptive data could be packaged within anOpenURL Context Object in the dcterms:bibliographicCitation element. Sucha construct, however, is fairly complex, and we believe that to require this ap-proach would place an unnecessary and in some cases insurmountable burdenon data providers and harvesters.

An acknowledged weakness of the proposed application profile is thehandling of publications at the monographic level. At several points, the MLAPis currently optimized for serial literature. There are a number of solutionsto this problem, if in fact it is perceived as a problem, but they are all in thedirection of increased complexity. For example, as constructed, it is not possibleto capture a role attribute with the contributor element (such as “editor,” or“translator”). Allowing for this would require that contributor become a distinctentity in the data model so that properties could be associated directly with it.This leads to an additional level of complexity, and whether it is desirable to goin this direction is an open question.

There are a number of other open issues in the proposed MLAP. These arenoted in the complete usage guidelines. Next steps include obtaining inputand discussion from the broader DML community regarding proposals madehere. We hope that such feedback will help resolve open issues and allowfor refinement of the MLAP. Once an acceptable profile can be agreed upon,working implementations can test the MLAP further.

References

1. Dublin Core Metadata Element Set, Version 1.1. http://www.dublincore.org/documents/dces/

2. Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). http://www.openarchives.org/pmh

3. Arms, W., Dushay, N., Fulker, D., Lagoze, C.: A case study in meta-data harvesting: the NSDL. Library Hi Tech 21, no. 2, 228–237 (2003).doi:10.1108/07378830310479866

4. Lagoze, C., Krafft, D., Cornwell, T., Dushay, N., Eckstrom, D., Saylor, J.: Metadataaggregation and “automated digital libraries”: a retrospective on the NSDLexperience. Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digitallibraries 230–239 (2006). doi:10.1145/1141753.1141804

5. Bruce, T., Hillmann, D.; The Continuum of Metadata Quality: Defining, Expressing,Exploiting. In: Hillmann, D., Westbrooks, E., (eds.), Metadata in Practice, pp. 238–256.ALA, Chicago (2004).

6. Dublin Core Metadata Initiative Abstract Model. http://www.dublincore.org/documents/abstract-model/

7. W3C Semantic Web. http://www.w3.org/standards/semanticweb/8. SWAP: Scholarly Works Application Profile. http://www.ukoln.ac.uk/

repositories/digirep/index/SWAP9. Dublin Core Collections Application Profile. http://dublincore.org/groups/

collections/collection-application-profile/10. The Singapore Framework for Dublin Core Application Profiles. http://www.

dublincore.org/documents/singapore-framework/

http://www.dublincore.org/documents/dces/

http://www.dublincore.org/documents/dces/

http://www.openarchives.org/pmh

http://www.openarchives.org/pmh

http://dx.doi.org/10.1108/07378830310479866

http://dx.doi.org/10.1145/1141753.1141804

http://www.dublincore.org/documents/abstract-model/

http://www.dublincore.org/documents/abstract-model/

http://www.w3.org/standards/semanticweb/

http://www.ukoln.ac.uk/repositories/digirep/index/SWAP

http://www.ukoln.ac.uk/repositories/digirep/index/SWAP

http://dublincore.org/groups/collections/collection-application-profile/

http://dublincore.org/groups/collections/collection-application-profile/

http://www.dublincore.org/documents/singapore-framework/

http://www.dublincore.org/documents/singapore-framework/

34 David Ruddy

11. Guidelines for Dublin Core Application Profiles. http://www.dublincore.org/documents/profile-guidelines/

12. Criteria for the Review of Application Profiles. http://www.dublincore.org/documents/profile-review-criteria/

13. Digital Math Library Dublin Core (dml_dc): A Recommended Best Practicefor Unqualified Dublin Core Metadata Records. http://projecteuclid.org/documents/metadata/dml_dc/

14. Functional Requirements for Bibliographic Records: Final Report. IFLA Study Groupon the Functional Requirements for Bibliographic Records. (UBCIM Publications,New Series; v. 19). München: K.G. Saur, (1998). http://www.ifla.org/VII/s13/frbr/frbr.htm

15. Description Set Profiles: A constraint language for Dublin Core Application Profiles(currently a Working Draft). http://www.dublincore.org/documents/dc-dsp/

16. Mathematical Literature Application Profile: Description Set Profile. http://projecteuclid.org/documents/metadata/mlap/mlap_dsp.xml

17. Mathematical Literature Application Profile: Property Definitions and Guidelines.http://projecteuclid.org/documents/metadata/mlap/

18. Expressing Dublin Core metadata using the DC-Text format. http://www.dublincore.org/documents/dc-text/

19. Expressing Dublin Core Description Sets using XML (DC-DS-XML). http://www.dublincore.org/documents/dc-ds-xml/

20. Expressing Dublin Core metadata using the Resource Description Framework (RDF).http://www.dublincore.org/documents/dc-rdf/

21. Expressing Dublin Core metadata using HTML/XHTML meta and link elements.http://www.dublincore.org/documents/dc-html/

Appendix: Properties of the MLAP

The following table lists the properties of the proposed Mathematical LiteratureApplication Profile (MLAP). A complete specification for the MLAP is providedonline in an XML expression of the Description Set Profile (DSP) [16].

URIs for the namespace abbreviations included in the table are as follows(see Table 1 on page 31 for more information):

dcterms http://purl.org/dc/terms/cld http://purl.org/cld/terms/prism http://prismstandard.org/namespaces/basic/2.0/

publication properties

Property Namespace Min Max Value Constraints

type dcterms 0 1 Value must be a URI; recommendedpractice is to use a value from theEprints Type Vocabulary EncodingScheme.

http://www.dublincore.org/documents/profile-guidelines/

http://www.dublincore.org/documents/profile-guidelines/

http://www.dublincore.org/documents/profile-review-criteria/

http://www.dublincore.org/documents/profile-review-criteria/

http://projecteuclid.org/documents/metadata/dml_dc/

http://projecteuclid.org/documents/metadata/dml_dc/

http://www.ifla.org/VII/s13/frbr/frbr.htm

http://www.ifla.org/VII/s13/frbr/frbr.htm

http://www.dublincore.org/documents/dc-dsp/

http://projecteuclid.org/documents/metadata/mlap/mlap_dsp.xml

http://projecteuclid.org/documents/metadata/mlap/mlap_dsp.xml

http://projecteuclid.org/documents/metadata/mlap/

http://www.dublincore.org/documents/dc-text/

http://www.dublincore.org/documents/dc-text/

http://www.dublincore.org/documents/dc-ds-xml/

http://www.dublincore.org/documents/dc-ds-xml/

http://www.dublincore.org/documents/dc-rdf/

http://www.dublincore.org/documents/dc-html/


http://purl.org/cld/terms/




title dcterms 1 1 A single, primary title is required.Language attribute may be provided.Value may include XML content.

alternative dcterms 0 1 Additional titles for the samepublication (variants, translations,transliterations). Multiple value strings(titles) may be included; languageattributes are required on each. Valuestrings may include XML content.

creator dcterms 0 ∞ If used, a value string is required;a value URI may be provided.

contributor dcterms 0 ∞ If used, a value string is required;a value URI may be provided.

abstract dcterms 0 1 Multiple value strings (abstracts) maybe included; language attributes arerequired on each. Value strings mayinclude XML content.

subject dcterms 0 ∞ If used, a value string is required; itmay be from a controlled vocabulary.A value URI may also be provided.

issued dcterms 1 1 Date of publication is required; valuemust adhere to W3CDTF syntax.

language dcterms 0 ∞ Language or languages of thepublication; values must be taken fromRFC4646.

format dcterms 0 ∞ Format (Internet media type) ofelectronic file; values must be from theIMT vocabulary.

bibliographic-Citation

dcterms 1 1 A description of the bibliographicsource of the publication is required.Value may be a text string, anOpenURL Context Object, or both.

startingPage prism 0 1 The first page of the publication.endingPage prism 0 1 The last page of the publication.doi prism 0 1 A DOI for the publication.url prism 1 1 A URI that resolves to a publication

record page is required.identifier dcterms 0 ∞ Additional identifiers for the

publication may be provided; allidentifiers must be URIs.

36 David Ruddy


publisher dcterms 0 1 The publisher of the publication.rights dcterms 0 ∞ Zero or more statements, or value URIs,

concerning copyright ownership or thepermitted uses of the publication.

accessRights dcterms 0 1 Must be one of two possible values:restricted or unrestricted.

references dcterms 0 ∞ A work referenced by the publication.Each element value may be a text string,an OpenURL Context Object, or both.

isAccessedVia cld 0 1 The service that provides access to thepublication. A value URI is required.A string value may also be provided.

isPartOf dcterms 0 1 The value of this property is thepublicationContainer description.

publicationContainer properties

Property Namespace Min Max Value

publication-Name

prism 0 1 The title of the parent publication; forexample, a journal, book, orproceedings title.

contributor dcterms 0 ∞ A contributor to the parent publication,such as an editor of a book orproceedings. If used, a value string isrequired; a value URI may be provided.

issn prism 0 1 A journal ISSN number.eIssn prism 0 1 A journal e-ISSN number.isbn prism 0 1 A book ISBN number.doi prism 0 1 A DOI for the parent publication.identifier dcterms 0 1 Additional identifiers for the parent

publication may be provided; allidentifiers must be URIs.

volume prism 0 1 A journal volume number or otheralphanumeric volume identifier.

number prism 0 1 A journal issue number or otheralphanumeric issue identifier.

Designing a Semantic Ground Truthfor Mathematical Formulae

Alan Sexton1?, Volker Sorge1?, and Masakazu Suzuki2?

1 School of Computer Science, University of Birmingham, [email protected], [email protected]

http://www.cs.bham.ac.uk/~aps http://www.cs.bham.ac.uk/~vxs2 Faculty of Mathematics, Kyushu University, Japan

[email protected], http://www.math.kyushu-u.ac.jp/~suzuki/

Abstract. We report on a new project to design a semantic ground truthset for mathematical document analysis. The ground truth set will begenerated by annotating recognised mathematical symbols with respectto both their global meaning in the context of the considered documentsand their local function within the particular mathematical formula theyoccur. The aim of our work is to have a reliable database available forsemantic classification during the formula recognition process with theaim of enabling correct interpretations of mathematical formulae andgenerating semantic markup such as Content MathML.

1 Introduction

Ground Truth sets are manually annotated or validated sets of training datathat are important tools for many recognition tasks. In document image analysisresearch, ground truth data is crucial for the design, training and testing ofalgorithms for data identification and extraction. A ground truth set for anoptical character recognition (OCR) system generally consists of images ofsingle characters together with their correct syntactic interpretation, e.g. inthe form of ASCII code. Bespoke ground truth sets have to be developed tocater for types of target documents and recognition methods. In spite of theavailability of automated tools for the development of ground truth sets forcertain special cases [8], in the majority of cases ground truth sets can only beassembled semi-automatically and their manual correction is generally a verylaborious task.

For mathematical OCR and formula recognition, assembling a ground truthset is an even more daunting task as it not only needs to contain a largenumber of, often very similar, symbols but also has to cope with the twodimensional layout of mathematical formulae and therefore contain spatialinformation. There is currently only one ground truth set for mathematicaldocuments available [6]. It has been constructed from 30 different articles on

? The authors’ work was supported by Royal Society International Joint Project(2008/R3)




http://www.cs.bham.ac.uk/~aps

http://www.cs.bham.ac.uk/~vxs


http://www.math.kyushu-u.ac.jp/~suzuki/



38 Alan Sexton, Volker Sorge, Masakazu Suzuki

pure mathematics. It is a database of over 680,000 characters occurring both intext and mathematical formulae in the articles and can be used as a collectionof statistical information about the relative occurrence of, and relationshipsbetween, neighbouring characters. It does not contain information about thestructural nature of the expression as a whole that the symbols are containedin. Since most of the characters appear many times in this database, thereis a large amount of information that can be mined from the database. Forinstance, the ground truth set has been used to compile statistics on spatialrelationships between mathematical symbols [1] that are exploited to resolvesub- and superscript relationships within the mathematical formula recognitionof the Infty system [5].

While [6] can greatly improve the robustness of algorithms for correctsyntactic recognition, it is currently still of limited use for extracting semanticmeaning of formulae. However, often mathematical formulae can only becorrectly recognised if the underlying semantics of the formula is clear.Enriching a ground truth set for mathematical OCR with semantic informationcould therefore be desirable, and if a semantic ground truth set can beconstructed successfully, it should not only enhance current formula recognitiontechniques, but also enable direct translation of expressions from mathematicaldocuments into semantic markup such as Content MathML or OpenMath. Thiswould aid accessibility tools in interpreting functions and their componentscorrectly and also make content of mathematical documents amenable tomathematical software such as Computer Algebra systems.

Part of the problem in building high quality mathematical formularecognisers is the ambiguity caused by similar constructs used for different

mathematical concepts. To take just one example, is(n

k

)a binomial coefficient

(n choose k) or a vector? Without context we can not be confident of our answer.A semantic ground truth for mathematical expressions would help us to baseour decisions on well-founded scientific data, rather than the programmer’smathematical intuition.

2 Semantic Ground Truth

The aim of our work is to compile a semantic ground truth set for mathematicalformula recognition. We propose to implement this in two parts. The first partis a semantic ground truth for mathematical characters and symbols—here weare concerned with associating low level information of individual symbolsand their local neighbours (e.g. font information spacing and relative baselinepositions) with mathematical objects and constructs from the mathematicaldomain that the document in question belongs to. The second part is a semanticground truth for mathematical expressions as a whole.

Designing a Semantic Ground Truth for Mathematical Formulae 39

2.1 Semantic Symbol Ground Truth Set

Our approach to constructing the semantic symbol ground truth set is toannotate the mathematical symbols occurring in a syntactic ground truth setsimilar to [6]. Annotations will be based on the following three levels:

1. Subject area2. Usage of a symbol3. Definition within a given context.

The three different types of annotations enable the description of a symbol’ssemantics on three different levels of granularity.

Subject Area Each symbol will have an annotation attribute for its origin insome mathematical field, which will correspond to the two first digits ofthe AMS Mathematics Subject Classification of 2000 [7]. This refers to thegeneral mathematical field the document belongs to from which the particularsymbol was extracted. Symbols, as well as documents, can quite correctly havemultiple different classifications, and classifications for individual symbols candiffer from the classification of the document as a whole. We intend to recordthese different classifications so that we can mine the information to obtainprobabilistic heuristics for identifying the classification of symbols in variouscontexts.

Usage of Symbol A common problem with the correct interpretation ofmathematical symbols is that they often have different meanings depending onthe overall mathematical area or the local context in which a formula occurs.Therefore, one of the semantic annotations will record the exact mathematicalusage of each symbol, e.g., is it a function symbol, an operator, a relation etc.,in the formula from which it was extracted. For example the following twoformulae give the symbol g two distinct meanings:

g ∈ G (1)g ∈ BA (2)

In (1) g is declared as an element of a group G, whereas in (2) g represents afunction with domain A and co-domain B. Consequently the former wouldbe annotated as an ordinary symbol while the latter would be annotated as afunction symbol. The usage of g can then be interpreted differently in otherexpression. For instance, in the expression

g(hk)

according to the semantic usage in (1), it would be part of a multiplicationwithin a group, while it has to be interpreted as a function application accordingto the semantics given to it in (2).


Definition The most fine-grained semantic annotation will be based on themathematical definition of a particular symbol in the context of the particulardocument it has been extracted from. We will use, as far as possible, thedefinition given in OpenMath content dictionaries [4] as annotations.

2.2 Semantic Expression Ground Truth Set

While we can attach semantics to individual symbols, and, to a certain extent,relationships between neighbouring symbols, this does not extend to wholeexpressions where the relationships are between neighbouring sub-expressions,rather than simply symbols. Before we can hope to attach such expressionsemantics, therefore, we need to identify sub-expressions for semantics to beattached to. We therefore propose to build abstract syntax trees (ASTs) for theexpressions in our set and attach semantics to the nodes in these.

Since the leaves of the ASTs would consist of the single characters andsymbols in the expression, they would automatically have the annotations ofthe symbol ground truth. The inner nodes of an AST would then inherit thesubject area annotation. While an inner node would not have a usage annotation,it will be annotated with the definition corresponding to the semantics of thesub-expression rooted in this node.

For example, an expression of the form (1 2 3) in group theory has assymbol annotations, open fence, three ordinaries, and closed fence, whilethe three ordinaries in turn have definition annotations as integers. TheAST representing the entire expression then will have a separate definitionannotation, which will be permutation.

3 Automated Generation

We intend to assemble the semantic ground truth set based on the machinerydesigned and implemented for the syntactic ground truth set presented in [6].This facilitates the automatic recognition of mathematical symbols using theInfty system with subsequent manual correction. We are extending these toolsto enable the handling and storage of semantic annotations.

While the subject classification will be entered globally for all symbolsfrom an article, the other two semantic annotations will be entered using asemi-automated approach, via a hangman style completion mechanism. Thatis, for each article in the ground truth set, symbols occurring in mathematicalexpressions will be annotated manually. If the same symbol occurs elsewherein the same article it will automatically be given the same annotation. Thus thenumber of symbols that need to be annotated should gradually decrease withthe majority of work spent on manually checking and correcting annotations ifthe completion yields incorrect results (e.g., if one symbol is used with differentsemantic meanings in different contexts).

Moreover, we intend, as much as possible, to exploit automatic classificationof symbols with respect to the second semantic annotation for basic mathemati-cal usage of symbols. We have recently developed a mechanism to categorise

Designing a Semantic Ground Truth for Mathematical Formulae 41

symbols based on special relations between symbols occurring in a formula [3].The spatial analysis is based on re-engineering the basic layout rules whichare traditional in mathematical typesetting and which are also employed bythe LATEX system. The mechanism is currently employed for formula recog-nition from PDF documents [2], however, this will be extended in order towork for formula recognition in a more general context, such as from scanneddocuments.

As a simple example of the working of this algorithm consider the followingtwo formulae:

xRy→ yRx (3)x R y→ y R x (4)

Here the spacing between the symbols in (3) would not distinguish the threeoccurring letters and therefore classify all three as ordinal symbols. On the otherhand the increased spacing between the R and x,y in (4) would automaticallyidentify R as a relation symbol, which is indeed its intended meaning in thegiven formula.

While these techniques can assist in assigning semantics to symbols theyare not sufficient for annotating at the level of expressions in ASTs. Normally,ASTs are the output of a parser using a set of grammar rules. Here, we want touse the resulting ground truth set to deduce the appropriate grammar rules.Therefore, we must construct the ASTs manually, to correspond to how a humanmathematician understands the expression. With such ASTs, we can then attachsemantic interpretations to the nodes to complete our semantic expressionground truth set.

Constructing ASTs manually is a particularly arduous task. We areimplementing a tool that uses previous work on formula recognition on PDFdocuments [2,3] to construct an initial proposed AST, and provide a convenientuser interface for manipulating this tree into one closer to what the humanmathematician understands. At this point, semantic annotations for nodes canbe added so that, in the binomial coefficient/vector problem mentioned inthe introduction, the AST would be the same in both cases but the semanticannotation would be different. Since our previous recognition tool can alreadyproduce some Content MathML we hope to exploit this to aid the manualannotation.

4 Conclusion

We have outlined our current project of creating a semantic ground truth setthat should help to extend mathematical formula recognition techniques toproduce semantically marked-up results. We are currently preparing the basicmachinery to assemble the semantic ground truth set. One obstacle is that wecan not simply base it on the existing syntactic ground truth set [6] due tocopyright issues. While this implies that we will have to start from scratch itwill also give us the opportunity for a new, careful selection of documents toinclude that will obviate any future copyright or non-open access issues.


We currently envisage that the annotation with respect to subject area andgeneral usage of symbols is fairly straightforward and can be mostly automated,while entering the exact mathematical definitions will need to be done manuallyand might be more difficult to complete for the entire set.

Other potential problems with the construction of a semantic ground truthdata set that could occur are:

– Ambiguities in the meaning of mathematical notation can not be resolvedby considering a single article of the ground truth set, but will need abackground knowledge of the mathematical literature in the field.

– The current semantic formalisation given in the OpenMath contentdictionaries are not sufficient for annotating the given data.

– The OpenMath formalisations are not at the right level to give semanticmeaning to “human oriented” mathematics.

Since, to the best of our knowledge, no work has been done to create semanticground truth in the context of document analysis, we currently have nocomparison to assess these factors. However, we strongly believe that ifour attempt is successful the work could potentially be of great impact formathematical formula recognition.

References

1. W. Aly, S. Uchida, A. Fujiyoshi, and M. Suzuki. Statistical classification of spatialrelationships among mathematical symbols. In: Proceedings of ICDAR 2009, pages1350–1354. IEEE Society Press, 2009.

2. J. Baker, A. Sexton, and V. Sorge. A linear grammar approach to mathematicalformula recognition from PDF. In: Proceedings of Intelligent Computer Mathematics,LNAI. Springer Verlag, Germany, 2009.

3. J. Baker, A. Sexton, and V. Sorge. Faithful mathematical formula recognition fromPDF documents. In: Proceedings of DAS 2010, 2010. Forthcoming.

4. S. Buswell, O. Caprotti, D. P. Carlisle, M. C. Dewar, M. Gaëtano, and M. Kohlhase.The OpenMath Standard. The OpenMath Society, June 2004.

5. M. Suzuki, F. Tamari, R. Fukuda, S. Uchida, and T. Kanahori. Infty—an integratedOCR system for mathematical documents. In: Proceedings of ACM Symposium onDocument Engineering, pages 95–104. ACM Press, 2003.

6. M. Suzuki, S. Uchida, and A. Nomura. A ground-truthed mathematical character andsymbol image database. In: Proceedings of ICDAR 2005, pages 675–679. IEEE SocietyPress, 2005.

7. The American Mathematical Society. 2000 Mathematics Subject Classification, 2000.http://www.ams.org/msc/.

8. J. van Beusekom, F. Shafait, and T. M. Breuel. Automated OCR ground truthgeneration. In: Proceedings of DAS 2008, Sep 2008.

http://www.ams.org/msc/

Part III

DML Building Experience

PDF Enhancements Tools for a Digital LibrarypdfJbIm and pdfsign

Radim Hatlapatka and Petr Sojka

Masaryk University, Faculty of InformaticsBotanická 68a, 602 00 Brno, Czech [email protected], [email protected]

Abstract. This paper describes several innovative PDF document enhance-ments and tools that can be used when building a digital library. The mainresult presented in this paper is the PDF re-compression tool, developedusing the jbig2enc encoder called pdfJbIm. This re-compression toolenables the size of the original bitonal PDFs to be, on average, downsizedby one third. Some modifications to the jbig2enc encoder that increasethe compression ratio even further are also described here. Together withanother program, the pdfsizeopt.py by Péter Szabó, we have managed todecrease PDF storage size to such an extent that the transmission needs ofa digital library were significantly reduced. We report the storage savingresults that we have achieved on The Czech Digital Mathematics LibraryDML-CZ — we have downsized the PDF corpus to 43% of its original size.We also describe pdfsign tool for batch digital signature stamping of PDFdocuments.Key words: jbig2enc, JBIG2, PDF size optimization, compression, DML,digital signature, JB2, DjVu, pdfJbIm, pdfsign, DML-CZ, EuDML

Smaller is faster and safer too. (Stephen Adams, Google)

1 Motivation

PDF is the most frequently used format for digital libraries today. Althoughstorage size is not the most important aspect when storing data, it must be takeninto account when delivering hundreds of thousands of documents to users.PDF allows any digital objects in a document to be compressed by differentmethods, supported in PDF format evolution.

The standard applications used to generate PDFs are usually inadequateto the optimal-size generation task. Only during PDF postprocessing, globaloptimizations like choosing the most appropriate compression and objectdecomposition might be performed. Postprocessing is the best way ofoptimizing the size and the page-at-time rendering for in-browser PDFrendering and for delivering PDFs from a digital library.

During the course of The Czech Digital Mathematical Library project [1],we started to experiment with different ways of PDF optimization and delivery.As most of the files are two-layer PDFs with scanned bitonal bitmap in front




http://dml.cz




46 Radim Hatlapatka, Petr Sojka

and the OCR recognized text behind, the recent JBIG2 standard, developed forone-bit depth scanned images, emerged as the most suitable for storing thiskind of digitized data. As we also wanted our data to be distinguishable fromother versions claiming the same content, we decided to digitally sign everyPDF that resulted from our project.

The JBIG2 standard and its formats are described in Section 2. Section 4on page 50 presents our application pdfJbIm to enhance the compression ratioachieved by the improved leading edge software which is then compared withpdfsizeopt.py program in Section 5 on page 51. The main results are discussedin Section 6 on page 51. An approach to using digital signature support in PDFsand a pdfsign tool which supports this is described in Section 7. We concludewith a summary and ideas for future work in the final Section 8 on page 54.

Not only are PDF files that contain JBIG2 compressed informationeasier to send and share, but they are easier to store,

they display rapidly online, and they are OCR ready.(Franc Gagnon, 2010)

2 Introduction to JBIG2

JBIG2 is a standard for the compression of bitonal images developed by JointBi-level Image Experts Group. These are images that consists of two colorsonly (usually black and white). The main area of such images is scannedtext. JBIG2 was published in 2000 as an international standard ITU T.88 [17]and one year later as ISO/IEC 14492 [6]. It typically generates files that arethree to five times smaller than Fax Group 4 and two to four times smallerthan JBIG1, which was the previous standard released by Joint Bi-level ImageExperts Group) [12]. JBIG2 also supports lossy compression that increases thecompression ratio several times without any noticeable visual difference whencompared with lossless mode. Lossy compression without noticeable lossinessis called perceptually lossless coding. Scanned text often contains flyspecks (tinypieces of dirt) and perceptually lossless coding can help get rid of the flyspecksand thus increase the quality of the output image.

2.1 Basic Principles of JBIG2

The content of each page may be segmented into three regions — text, halftoneand generic. Text regions contain text, halftone regions contain halftone images1

and generic regions contain everything else. In some situations it is better touse text regions rather than generic ones and in other situations, the converse istrue. JBIG2 encoder segments text regions into their constituent symbols. Thesesymbols must then be encoded. JBIG2 uses modified versions of Arithmeticand Huffman coding. Huffman coding is used mostly by faxes because of itslower computation demands, even though Arithmetic coding does give slightlybetter results.

1 More about halftone can be found at http://en.wikipedia.org/wiki/Halftone

http://en.wikipedia.org/wiki/Halftone

PDF Enhancements Tools for a Digital Library 47

2 0 obj << /DecodeParms<< /JBIG2Globals 1 0 R >>/Width 3265 /BitsPerComponent 1 /Height 4911/Filter /JBIG2Decode/Subtype /Image/Length 4582/ColorSpace /DeviceGray/Type /XObject

>>stream...endstream

Fig. 1. Example of storing JBIG2 image in PDF document

JBIG2 also supports the multi-page compression that symbol coding uses(coding of text regions). Any symbol that is frequently used on more than onepage is stored in a global dictionary. They only need to be stored once, therebyreducing the space needed to store documents.

2.2 JBIG2 in PDF

Since PDF version 1.4 (2001, Acrobat 5, see 3rd edition of the PDF Referencebook) support for JBIG2Decode filter [15] has been embedded. This allows usingimages compressed according to standard JBIG2. This support has allowedJBIG2 standard to quickly spread far and wide without placing any burden onthe end user.

PDF discards headers and some other data from JBIG2 images and sends thisinformation to the PDF dictionary associated with the image object stream ascan be seen in Figure 1.

2.3 JB2 in DjVu

DjVu [4] is an open digital document format with advanced compressiontechnology and high performance value. It was developed at AT&T Labsto solve the problem of transporting documents containing high resolutionscanned images via the Internet. It became an alternative format to PDF. DjVuwas initially considered to be vastly superior to PDF, but since PDF version 1.4(support of JBIG2 in PDF) this is no longer the case.

DjVu encoded images consist of three parts — foreground image, back-ground image and mask image. The first two are low-resolution colored imagesand the last is a high-resolution bi-level image. A mask image will show if acolor from the background or the foreground image should be used. For com-pressing the foreground and the background image, a compression algorithmcalled IW44 is used. For compressing the mask image a method called JB2 isused.


The JB2 algorithm is a variation of AT&T’s proposal to the upcoming JBIG2standard. The basic ideas behind JB2 are very similar to JBIG2. The basicimage is first segmented into individual marks (connected components ofblack pixels). The marks are clustered hiearchically based on similarity usingan appropriate distance measure. While the image is coded for each mark(symbol), an identifying index and position relative to that of the previous markis specified. Marks are coded using a statistical model and arithmetic coding.Some are coded directly and some are coded indirectly based on previouslycoded marks. If it is the first occurrence of the mark it is coded directly, thebitmap also being coded; if not, it is coded indirectly with only its position andreference being coded.

For the clustering and the conditional encoding of marks, an algorithmcalled “soft pattern matching” [11] is currently used.

Unlike OCR, JB2 coding solves the problem of substitutional errors. If wehave an imperfectly scanned symbol (e.g. due noise), it can be improperlymatched and treated as a totally different symbol. This is one of the reasonswhy no OCR engine has one hundred percent accuracy. In OCR, it is necessaryto substitute a symbol, whereas in JB2 and JBIG2 we can easily place this inthe dictionary as a new symbol instead of being uncertain if it is a correctsubstitution.

The most simple methods are used that do not significantly sacrifice performance.(Adam Langley)

3 Jbig2enc and its Improvement

Jbig2enc [13] is an open-source encoder developed with Google support byAdam Langley under an Apache License. It uses the Leptonica library [2] forimage manipulation. Leptonica takes care of rendering text, comparison ofsymbol components, aligning, etc.

Jbig2enc supports only arithmetic coding and instead of halftone regionsit uses generic regions. It has embedded support for creating an output in aformat suitable for PDF documents.

If we use symbol coding, some kind of lossy coding with a vision to beperceptually lossless is always possible. This is because almost all symbols in adocument vary slightly since they are not identical at the pixel level. For thispurpose a thresholding value in jbig2enc is used — it says how similar twosymbols must be to be considered equivalent. The default value used is 0.85,meaining that two symbols must be at least 85% the same to be consideredequivalent. The maximum value allowed by the jbig2enc encoder is 0.9.2

3.1 Modification of Jbig2enc

We are improving the perceptually lossless coding of the encoder, jbig2enc,which is a component of pdfJbIm (see Section 4).

2 A value of 0.85 is usually more than sufficient but for very poor quality documents itmight not be optimal.

http://www.apache.org/licenses/LICENSE-2.0


At this point in its development we are trying to find accumulations ofdifferences between pairs of symbols3. To obtain the differences between twosymbols we apply an operation XOR and we get their XORed bitmap.

To find local differences we divided an XORed bitmap (represented as amatrix) into submatrixes4 and we count foreground pixels of four adjacentsubmatrixes if they contain enough different pixels to form a line or a point.Four adjacent submatrixes are counted because an accumulation could beplaced exactly on the border of the submatrix. We use different segmentation(different sizes of subimages) for each type of shape. This method enables usto find the accumulation of differences in shapes as points or lines, whethervertical, diagonal, or horizontal.

If we find an accumulation bigger than a counted threshold value (counteddifferently for each size of a symbol) we consider these two symbols as different.If we do not find such an accumulation of differences, we consider thesesymbols equivalent — all references to these templates are set to point only toone template and the second one is deleted.

This method for comparing symbols is run as an additional method afterthe Leptonica classifier. With this new algorithm we are able to increase thecompression ratio of the encoder jbig2enc by about eight percent, even forimages with relatively bad quality.

We are now working on embedding OCR tools and techniques that willenhance the comparison process of two symbols. It should allow us to decreasethe size of the output image as much as possible to the size of the born-digitaltext.

As Figure 2 on the following page shows, the image before and aftercompression look the same at first sight but the size of the compressed imageis less than 70% of the original one. But they are not exactly the same. Thereare slight differences which are shown in the third image in Figure 2 on thenext page. Removing the slight differences between the same symbols wouldimprove the quality of the output image.

Running jbig2enc removes some of the differences between the samesymbols which can make the output quality appear either better or worse.It is crucial that the right representative that will stay in the text is chosen.To guarantee the improvement of quality we always need to choose the bestsymbol from the equivalent ones. At this stage of development the first symbolis used as a representative symbol — it gives the same result as a random one(the quality remains mostly the same).

Embedding OCR tools could also help us to choose which symbol would bebetter as the representative one.

3 Working only with templates (representative symbols) returned by Leptonica4 Subimages, bitmap segments


Fig. 2. Example of part of the page before (upper frame) and after (middleframe) compression by jbig2enc: the lower frame shows the differencies

I don’t paint things. I only paint the difference between things. (Henri Matisse)4 PdfJbIm

PdfJbIm [9,10] is a tool written in Java for re-compressing bitonal images placedin a PDF using symbol coding of the jbig2enc encoder. It has been developedfor DML-CZ [8] and is still under development, with the goal being used inEuDML [14]. The main purpose is to decrease the size of PDF documentscontaining scanned text (mostly mathematical) and make it easier to transfersuch documents via the Internet so that download time and costs can bereduced.

PdfJbIm replaces images with their re-compressed versions. It uses thejbig2enc encoder and two libraries for this purpose: PDFBox [7] and iText [5].PDFBox is used to extract raw image data and convert them to suitable imageformat. IText is used for decrypting PDF if necessary and for replacing imageswith their re-compressed version. Information about images (their position anddimensions) are remembered during the process of extraction.

Jbig2enc allows the use of symbol coding that is suitable for scanned text.If symbol coding is engaged then it segments an image containing text into

http://dml.cz



components (mostly one component contains exactly one symbol) and comparethem. All the same components are put into a dictionary after which onlyreferences to the dictionary are used.

PdfJbIm uses the modified version of jbig2enc to achieve better results byenabling this option as an argument in a command line.

Premature optimization is the root of all evil (or at least most of it) in programming.(Donald Ervin Knuth)

5 Pdfsizeopt.py

Pdfsizeopt.py [16] is a script written in Python (under GNU General PublicLicense). It combines different Unix tools and scripts to optimize the sizeof PDF files without causing any damage. To optimize content streams therecommended procedure is to use the commercial optimizer PDF Enhancer orAdobe Acrobat first, then to run pdfsizeopt.py for optimizing mainly imagesand Type1 fonts. Pdfsizeopt.py uses also Multivalent tool.pdf.Compress todo most of the remaining work. If Multivalent is installed, pdfsizeopt.py willrun it automatically.

Pdfsizeopt.py also removes duplicate and unused data, serializes stringsmore effectively, compresses streams by high-effort ZIP, removes pagethumbnails since they can be created whenever they are needed. For thesepurposes it uses many different tools e.g. ghostscript, pdftk, jbig2enc,sam2p, pngout, Multivalent and png22pnm. For example it uses ghostscriptto convert fonts to CFF (Compact Font Format — Type 2, Type 1C). Then itunifies subsets of the same fonts.

Optimized PDF files created by pdfsizeopt.py use some specifics that areused in PDF since version 1.5 which means that for viewing the PDFs is requiredAcrobat 6 or newer.

We must develop knowledge optimization initiatives to leverage our key learnings.(Scott Adams)

6 Combining Pdfsizeopt.py and PdfJbIm

To represent the results of these two optimizers we have applied them to thedata from the DML-CZproject. We used PDF documents from the journal,Applications of Mathematics, which contains 19,690 pages in 1,799 papers. Thesedocuments are two-layer documents with OCR text for indexing and search(with possible errors) hidden behind the scanned bitmap.

In most documents the threshold value 0.85 is a sufficient guarantee of novisible loss but for documents of very poor quality, this does not suffice — wefound some visible losses for some letters using the Cyrillic alphabet. For thisreason we use a threshold value 0.9 which seems to be safe.

In Table 1 on the following page, we present the results of running theoptimizer pdfJbIm and pdfsizeopt.py. To retrieve statistical data we use pdf-sizeopt.py with the option --stats before and after running the optimizer

http://www.gnu.org/licenses/old-licenses/gpl-2.0.html

http://www.gnu.org/licenses/old-licenses/gpl-2.0.html

http://dml.cz


pipeline. The table shows how much is stored in each part of the PDF documentas well the size of the whole document as average values that we have got fromthe optimized PDFs.

The most significant items for comparison are the rows image objects andother objects. This is because pdfJbIm only tries to reduce the size of images —it does not optimize other parts of PDF documents. As part of image objectsis counted only size of objects specified in PDF as image objects without sizeof objects which they only are referencing to. Size of these kind of objects iscounted in the section other objects (the size of a global dictionary is counted inthis section as well).

Table 1. Average sizes (in bytes) of each type of PDF objects stored in multi-page, two layered (bitmap + OCR below) PDF documents created by pdftk.PdfJbImuses modified jbig2enc with enabled symbol coding. The thresholdvalue is set to 0.9. Pdfsizeopt.py uses Multivalent and generic coding ofjbig2enc, and does not use pngout (has minimal effect if used or not becauseJBIG2 is the most common compression method used, not ZIP)

Original After using After using After usingPDF pdfJbIm pdfsizeopt.py both

Total size (in kB) 1,424 1,128 733 618Content objects (in kB) 55 55 55 55Font data objects (in kB) 464 464 77 77Header 15 15 15 15Image objects (in kB) 770 415 584 411Linearized Xref table 0 0 0 0Other objects (in kB) 127 185 17 75Separator data 25 23 23 23Trailer 120 121 107 102Wasted between objects 0 0 0 0Xref table 7,937 7,957 497 523

Table 2 on the next page shows how much the size of PDF documentsand the sizes of image data were reduced using pdfJbIm and pdfsizeopt.py incomparison with the original PDF. The results are represented as a percentageof the size of optimized PDFs in comparison with the original PDF file. It usesthe same PDF corpus as in the previous table.

Image data in this case are counted together with the data of other objectsbecause the global dictionary is stored as a separate object. The size of thisobject is counted in the section other objects. Unfortunately, the global dictionaryis not the only thing stored as other object. To at least partly distinguishwhich part of the data stored in other objects is a global dictionary, we use sizeof other objects after running both optimizers for summarizing. For pdfsizeopt.pythere is nothing more to optimize in the global dictionary data. By combining


Table 2. New size of PDF document in comparison with the original one(generated by FineReader and TEX).

Original After using After using After usingPDF pdfJbIm pdfsizeopt.py both

Size of whole PDF (in %) 100 79.21 51.49 43.41Size of image and other objects(in %)

62.96 34.44 42.21 34.12

these two values, the results adequately indicate how effective each optimizeris for reducing the size of the image data.

As the table shows the pdfJbImgives significantly better results thanpdfsizeopt.py for images stored in multi-page PDF documents. We can see thatthe best solution is to run pdfJbIm first and pdfsizeopt.py on the result. It isbetter to run pdfJbIm first beacause of the performance. In most cases, imagesare compressed by JBIG2 and if they are, pdfsizeopt.py will not have to tryother types of compression methods.

Instead of using optimizers, it could seem to be more beneficial to use theDjVu format instead of PDF. DjVu give us smaller documents than unoptimizedPDFs but not as small as the optimized ones.

For comparison, we present the results given by running pdf2djvu with thesame corpus as in the tables above. By transforming PDF documents to DjVuusing pdf2djvu, the size of each document was reduced on average to 28.05%.This is half the size reduction compared to running both PDF optimizers(pdfJbIm and pdfsizeopt.py).

You can’t trust code that you did not totally create yourself.(Ken Thompson’s address on receipt of the 1983 Turing Award)

7 Digital Signature of PDF Documents

PDF allows signing the documents digitally using a public key infrastructure —PKI, an architecture to verify the identities of people, web sites and computerprograms on the Internet. Even sites delivering scientific content may wantto prove — for several reasons — that the delivered documents originated ina trustworthy chain from author via publisher to the digital library. Digitalsignatures guarantee the identity of the data provider, confirming data integrityand making the authorship undeniable. It is important in cases of plagiarism,and it may also increase the credit of a project site — it may even increase theranking of the site by some search engines. All PDF documents of the DML-CZproject are digitally signed with the DML-CZ certificate, and score high inGoogle searches.

It is hardly economical to sign thousands of PDFs manually, thus a batchdigital stamper program pdfsign has been developed [3]. It is implemented inJava using the iText library [5]. The hash function of class SHA-2 (SHA-512) isused when signing. Also, the mechanism of Timestamp Authority is used as weexpect the documents to be valid for a long time.

http://www.abbyy.com/ocr-software/

http://dml.cz

http://dml.cz


pdfsign is applied at the very end of the PDF generation workflow, after allOCR, cover page typesetting and page merging, optimization and compressionsteps have been performed. Signing a document usually increases the PDF sizeby several bytes only, and does not depend on previously applied compression.For safety reasons, digital signatures are added on a different site from the onethat houses the digital library.

Similar program JsignPdf has been posted on http://sourceforge.net/projects/jsignpdf/ recently. It uses the older hash function SHA-1, whichNIST does not consider safe. JsignPdf does not provide support for batchprocessing of large volumes of PDFs as pdfsign does.

8 Conclusion and Future Work

We have shown how to significantly compress PDFs with bitonal scannedimage — when running both optimizers (pdfJbIm and pdfsizeopt.py) we wereable to reduce the size of PDF documents to less than half of the original size.We anticipate even better results when all the planned improvements in ourpdfJbIm will be made. Improvements in PDF compression were orthogonalto the digital signature which is added by the pdfsign batch PDF stampersoftware to increase the trustworthiness of the content of a digital library.

We are now developing the integration of OCR tools to jbig2enc to improvethe comparison process in symbol coding. This approach should help us todecide which symbol to choose as a representative one to further improve pagerendering and to increase readability.

Acknowledgement

We acknowledge the partial support of the European Commision, under theGrant Agreement #250,503 (Project EuDML) and the partial support of MasarykUniversity by grant for student research asistents #22,525/2010.

References

1. Bartošek, M., Lhoták, M., Rákosník, J., Sojka, P., Šárfy, M.: DML-CZ: The Objectivesand the First Steps. In: Borwein, J., Rocha, E.M., Rodrigues, J.F. (eds.) CMDE 2006:Communicating Mathematics in the Digital Era, pp. 69–79. A. K. Peters, MA, USA(2008)

2. Bloomberg, D.: Leptonica. [online] (2010), [cit. 2010-04-25], http://www.leptonica.com/jbig2.html

3. Bocák, P.: Digitáne podpisované PDF dokumenty (Bachelor thesis written in Czech,Digital signatures of PDF documents). Masaryk University, Faculty of Informatics(advisor Petr Sojka), Brno, Czech Republic (2008)

4. Bottou, L., Haffner, P., Howard, P.G., Simard, P., Bengio, Y., Le Cun, Y.: High QualityDocument Image Compression with DjVu. Journal of Electronic Imaging 7(3), 410–425 (1998), http://leon.bottou.org/papers/bottou-98

http://project.dml.cz/docs/dmlcz-workflow-en.pdf

http://sourceforge.net/projects/jsignpdf/

http://sourceforge.net/projects/jsignpdf/

http://www.leptonica.com/jbig2.html

http://www.leptonica.com/jbig2.html

http://leon.bottou.org/papers/bottou-98


5. Bruno, L.: IText PDF. [online] (2009), http://www.itextpdf.com/6. Committee, J.: 14492 FCD. ISO/IEC JTC 1/SC 29/WG 1 (1999), http://www.jpeg.

org/public/fcd14492.pdf7. Foundation, T.A.S.: Apache PDFBox – Java PDF Library. [online] (2010), http:

//pdfbox.apache.org/8. Hatlapatka, R.: JBIG2 komprese (Bachelor thesis written in Czech, JBIG2 compres-

sion). Masaryk University, Faculty of Informatics (advisor Petr Sojka), Brno, CzechRepublic (2010)

9. Hatlapatka, R.: PDF Recompression using JBIG2. [online] (2010), http://nlp.fi.muni.cz/projekty/eudml/pdfRecompression/

10. Hatlapatka, R.: Source codes of pdfJbIm. [online] (2010), http://code.google.com/p/pdfrecompressor/

11. Howard, P.: Text image compression using soft pattern matching. Computer Journal40(2/3), 146–156 (1997)

12. ISO/IEC JTC1/SC29/WG1: JBIG Maui Meeting Press Release (December 1999),http://www.jpeg.org/public/mauijbig.pdf

13. Langley, A.: Homepage of jbig2enc encoder. [online], http://github.com/agl/jbig2enc

14. Sylwestrzak, W., Borbinha, J., Bouche, T., Nowinski, A., Sojka, P.: EuDML—Towardsthe European Digital Mathematics Library. In: Sojka, P. (ed.) Proceedings of DML2010. Masaryk University Press, Paris, France (Jul 2010)

15. Adobe Systems Incorporated: Adobe Systems Incorporated: PDF Reference, pp.90–100. Adobe Systems Incorporated, sixth edn. (2006), http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

16. Szabó, P.: Optimizing PDF output size of TEX documents. TUGboat 30(3), 112–130(2009), [cit. 2010-04-26], http://code.google.com/p/pdfsizeopt/

17. Union, I.T.: ITU-T Recommendation T.88. ITU-T Recommendation T.88 (2000),http://www.itu.int/rec/T-REC-T.88-200002-I/en

http://www.itextpdf.com/

http://www.jpeg.org/public/fcd14492.pdf

http://www.jpeg.org/public/fcd14492.pdf

http://pdfbox.apache.org/

http://pdfbox.apache.org/

http://nlp.fi.muni.cz/projekty/eudml/pdfRecompression/

http://nlp.fi.muni.cz/projekty/eudml/pdfRecompression/

http://code.google.com/p/pdfrecompressor/

http://code.google.com/p/pdfrecompressor/

http://www.jpeg.org/public/mauijbig.pdf

http://github.com/agl/jbig2enc

http://github.com/agl/jbig2enc

http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

http://code.google.com/p/pdfsizeopt/

http://www.itu.int/rec/T-REC-T.88-200002-I/en

Metadata Editing and Validationfor a Digital Mathematics Library

Miha Filej1, Michal Ružicka2, Martin Šárfy3, and Petr Sojka2

1 University of Ljubljana, Faculty of Computer and Information ScienceTržaška 25, 1000 Ljubljana, Slovenia

[email protected] Masaryk University, Faculty of InformaticsBotanická 68a, 602 00 Brno, Czech Republic,[email protected], [email protected]

3 Masaryk University, Institute of Computer ScienceBotanická 68a, 602 00 Brno, Czech Republic

[email protected]

Abstract. For preparing and validating metadata for the Digital Math-ematics Library DML-CZ, a new tool, the Metadata Editor, has beendeveloped. This paper outlines the procedures for Linguistic and geo-graphical localizations its components. Also mentioned are such aspectsas dynamic generation of form editing based on the XML Schema, thevalidation procedures as well as support for semiautomatic proceduresregarding quality assurance.Key words: DML-CZ, Metadata Editor, internationalization, translation,localization, validation, XML, forms, Ruby, Perl, JavaScript

1 Introduction

Since 2005, the Czech Digital Mathematics Library project (DML-CZ) [3] hasbeen under development in the Czech Republic. An important part of theproject has been the development of the Metadata Editor [4,11]—a client–serverweb application designed to manage, edit, and validate each article’s metadataand full texts prior to their integration into the digital library.

The Metadata Editor is open-source software (http://dme.sourceforge.net/) and having proven its efficiency is now in use in a variety of otherenvironments. These include the Faculty of Arts of Masaryk University, theKramerius project of the Moravian Library [13], and the Editor may also beused in the EuDML project [5] as well. In this article we present some recentdevelopments of the Metadata Editor.

2 On-line Submissions and Validation

The viability of a digital library rests with new acquisitions emerging mainly inthe form of born-digital publications. The born-digital inputs to the MetadataEditor come from different sources, primarily from editors of various journals.






http://dml.cz

http://dml.cz

http://dml.cz

http://dme.sourceforge.net/




58 M. Filej, M. Ružicka, M. Šárfy, P. Sojka

To assure a smooth integration of a new publication into the MetadataEditor Database, it has to satisfy a particular data format specification availableto all the contributors. For this reason, it was necessary to set up a safeand comfortable interface between the contributors and the Metadata Editor.Because the Metadata Editor is a web application, it is easy to provide the userswith direct on-line access based on a private user account in the Editor. Afterlogging in the user can upload a new delivery directly and it is automaticallyassigned to the appropriate journal. The new entries are automatically validatedso that the user gets warnings about inappropriate formats of data, while flawedsubmissions are completely rejected. It obviates later corrections and helps theusers to prepare data in the required format.

3 Dynamic Generation of Editing Forms

One of the most important functions of the Metadata Editor consists infacilitating interactive modification of metadata. The operators are allowedto browse the contents of the Metadata Editor database and make necessaryadjustments through the web-based interface of the relevant forms.

Since the metadata language is formally defined by an XML Schema, itis possible to generate the forms dynamically based on the XML Schemadefinition. The mechanism consists of server-side and client-side scripting. TheXML Schema is processed on the server by a Perl script generating a JavaScriptcode that is included in the web page and which is subsequently sent tothe client. This JavaScript code runs in the web browser of the end user andgenerates a form matching the language defined by the source XML Schema.

Not all features of the XML Schema are supported, but the mechanism ispowerful enough to satisfy the requirements. In addition to being a part of theMetadata Editor, a generalized version of the forms generator is available asa standalone open-source project [9].

4 Internationalizing the Metadata Editor

4.1 Internationalization, Translation, Localization

In a nutshell, adapting the user interface of an existing application to newlanguages involves changing the output in a way that will please the currentuser. While translation could easily be considered the most important part ofthis process, it is not enough by itself—both translation and localization arerequired.

When dealing with source and target regions that are not similar, a completelocalization of an application is difficult to achieve. Common parts of anapplication that have to be localized are time and date formats. The way timeor date is displayed—the number of digits used, the separators, the order ofdate components, whether the 24- or 12-hour format is used—these can allvary from region to region. In addition, time zones in which users reside may

Metadata Editing and Validation for a Digital Mathematics Library 59

differ. The process of localization has to ensure that every date output of theapplication is displayed relative to the corresponding time zone.

Depending on the degree of internationalization that needs to be performedand the locales that need to be supported, more specific issues may beencountered: pluralization, units conversion (metric vs. imperial, currenciesetc.), right-to-left text orientation. Particular attention has to paid to words orphrases that have different meanings due to cultural differences and may evenbe offensive.

4.2 Implementation

The Metadata Editor is built using a variety of technologies and programminglanguages. The part that interacts with the user is mostly handled by Ruby [8],which requires support from end libraries. In the past, there were various(incompatible) internationalization solutions in the Ruby ecosystem, eachsolving its own set of problems. In 2007 an effort to provide a generalizedlibrary emerged resulting in I18n [10], the library that is now the de factostandard for the internationalization of Ruby applications. Being a generalsolution it does not provide complex internationalization facilities; instead itdefines an interface for other libraries to extend its functionality and remaincompatible with each other at the same time.

I18n provides two basic methods, I18n.translate and I18n.localize (dueto frequent use abbreviated to I18n.t and I18n.l, respectively). I18n.t handlestranslation by mapping an explicitly defined namespaced key to a string ina natural language. The approach differs from the popular GNU gettext [6]which maps a string in a natural language to a string in another naturallanguage (although gettext’s .so and .po files can still be used with I18n tostore the translations). Having explicitly programmer-defined keys shouldresult in greater maintainability by simplifying the way translations are reusedthroughout the application and avoids the issue where two sentences in differentcontexts in a language translate to the same sentence in another language, andvice-versa. I18n.l takes various objects like time, date etc. and localizes themaccording to the defined localization rules.

I18n’s pluggable back-ends allow internationalized data to be stored indifferent ways. In addition to the gettext format mentioned above, YAMLfiles, various relational databases and key-value stores are available as storageoptions. By defining the interface for implementing a back-end, the I18n libraryenables programmers to build a custom storage solution that suits their needs.

4.3 Choosing a Locale

Apart from altering the code to replace hardcoded strings with calls to methodsthat translate and localize them, a logic that handles switching between thelocales needs to be introduced to the application. Since the Metadata Editor isa web application, the locale has to be set per request. With the help of sessionsand cookies it is possible to persist a given locale between requests of the same


user, so the question remains: which locale is to be introduced for the firsttime (for a new user with no cookies)? There is no safe way how to determinea locale for a user’s first request, but a web application is able to take a guessbased on a few hints.

The HTTP/1.1 protocol defines the Accept-Language header [7] and at firstit may be tempting to use the information provided by the user agent to set thedefault locale for the user, but there are several things to take into account [12].Many users are unaware of the setting which was probably set when the useragent was installed and is might not conform to their preferences. The useragent may send a request that only defines the language without specifyingthe region (e.g. instead of de-DE, de-CH or de-AT indicating German as spokenin Germany, Switzerland or Austria, respectively, only de may be requested).If the user does not access the application from his own machine, the inferredlocale may be inappropriate, especially when one is in a foreign country. Lastbut not least, the header may not be set at all.

Another clue from which the locale can be inferred is the user’s IP address.With the help of a database or an external geolocation service it is possible todetermine the user’s geographical origin; but the approach shares a lot of theshortcomings described above. It is important that the application is not boundto its guess but allows the user to set his own preference at any point of theinteraction. Whenever a locale is explicitly chosen, it is safe to assume it asa default for future requests from the same user.

To sum up, the logic for setting a locale has to consider (from highest tolowest priority): the previously set preference, the locale guessed from theHTTP headers, the locale guessed from the IP of the source and the defaultlocale. Ideally the logic would set the locale as soon as the request was received,at the beginning of the interaction with the user, the programme input. Then,when computing the output, dedicated functions would perform translationand localization depending on the set locale.

4.4 Refactoring, Dangers, Precautions

The effort of adapting an application to another language rests with thedifference between the source and the target language. Given that an adaptationto a broader set of languages is preferable, the codebase requires a majoraltering—a process that is prone to mistakes. The Metadata Editor beinga relatively complex codebase, taking precautions against introducing bugs isthe more important. The desired result of the process of internationalizationis—(at least when rendering in the original locale) that the output matchesthe output of the programme before it was adapted. To assure this a set ofspecifications is needed.

Automated software testing is strongly encouraged in the Ruby communityand the past few years have seen an evolution of tools and practices forunit, function and integration testing. In 2008 a tool called Cucumber [2] wasintroduced. It differs from other solutions in the way that specifications (calledfeatures) are not written in Ruby, but in a language called Gherkin [1]. This

Metadata Editing and Validation for a Digital Mathematics Library 61

domain specific language serves two purposes: documentation and automatedtests. It allows describing software behaviour irrespective of how that behaviouris implemented. Gherkin’s grammar has only a few simple rules and readslike spoken language. This allows feature specifications to be written andunderstood not only by programmers but by domain experts as well, thusincreasing the value of the specifications. While Cucumber itself is written inRuby, it can be used to test codes written in other languages, which makes itsuitable to cover the non-Ruby parts of the Metadata Editor.

In Figure 1 one can see that Cucumber communicates with the applicationat the framework level, offering a better control over the request parametersthan a direct communication with the application server or the web server levelwould provide.

web server

cucumberapplication server

client request/response

framework

applicationcode

test runner

feature specifications

interface

Fig. 1. Cucumber integration diagram

5 Conclusions

The Metadata Editor is a live, continuously developing project. New featuresare added as needed. The on-line input and validation service was worked into provide users with a comfortable and safe interface for data inclusion, theuser interface is dynamically generated based on the formal definition of themetadata, the localization of the Metadata Editor is in progress.

The Metadata Editor is used in several projects and will possibly be used inthe EuDML project as well.

References

1. aslakhellesoy / cucumber. [online], http://wiki.github.com/aslakhellesoy/cucumber/gherkin, Last edited by zwyan2009, 2 days ago [cit. 2010-04-28].

http://wiki.github.com/aslakhellesoy/cucumber/gherkin

http://wiki.github.com/aslakhellesoy/cucumber/gherkin


2. Cucumber : Behaviour driven development with elegance and joy. [online], http://cukes.info/, [cit. 2010-04-28].

3. Czech Digital Mathematics Library. [online], http://dml.cz/, [cit. 2010-04-24].4. Digitization Metadata Editor. [online], http://dme.sourceforge.net/,

[cit. 2010-04-28].5. EuDML: The European Digital Mathematics Library. [online], http://www.eudml.

eu/, This page was last modified on 20 January 2010, at 08:09. [cit. 2010-04-25].6. gettext. [online], http://www.gnu.org/software/gettext/, Updated: $Date:

2010/01/31 14:51:43 $ [cit. 2010-04-28].7. HTTP/1.1: Header Field Definitions : Accept-Language. [online], http://www.w3.

org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4, [cit. 2010-04-28].8. Ruby Programming Language. [online], http://www.ruby-lang.org/en/,

[cit. 2010-04-28].9. SchemaForms. [online], http://sforms.sourceforge.net/, [cit. 2010-05-30].

10. svenfuchs / i18n. [online], http://github.com/svenfuchs/i18n, [cit. 2010-04-28].11. Bartošek, M., Kovár, P., Šárfy, M.: DML-CZ Metadata Editor : Content Creation

System for Digital Libraries. In: Sojka, P. (ed.) DML 2008 – Towards DigitalMathematics Library. pp. 139–151 (2008), Birmingham, UK, July 27th, 2008.

12. Honomichl, L.: Accept-Language used for locale setting. [online], http://www.w3.org/International/questions/qa-accept-lang-locales, Last substantive up-date 2003-09-17 12:15 GMT. This version 2006-11-25 16:35 GMT [cit. 2010-04-28].

13. Šárfy, M.: Metadatový editor pro digitální knihovny. In: Knihovny soucasnosti 2009.pp. 140–154. Brno (2009), http://www.sdruk.cz/sec/2009/sbornik/2009-6-140.pdf, Sec u Chrudimi, CZ, June 23rd, 2009. ISBN 978-80-86249-54-4

http://cukes.info/

http://cukes.info/

http://dml.cz/




http://www.gnu.org/software/gettext/

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4

http://www.ruby-lang.org/en/

http://sforms.sourceforge.net/

http://github.com/svenfuchs/i18n

http://www.w3.org/International/questions/qa-accept-lang-locales

http://www.w3.org/International/questions/qa-accept-lang-locales

http://www.sdruk.cz/sec/2009/sbornik/2009-6-140.pdf

http://www.sdruk.cz/sec/2009/sbornik/2009-6-140.pdf

Implementing Dynamic Visualization asan Alternative Interface to

a Digital Mathematics Library

Zuzana Neverilová

Masaryk University, Faculty of InformaticsBotanická 68a, 602 00 Brno, Czech Republic

[email protected]

Abstract. This paper presents an alternative interface for browsing in theCzech Digital Mathematics Library (DML-CZ) using our Visual Browserweb browsing tool. Using dynamic visualization, we have created a toolfor browsing the library graphically. Visualization can help users orientthemselves in complex data and at the same time reveal sometimesunexpected relationships among units; it at least speeds up browsing.This work follows the metadata processing undertaken on DML-CZ andvisualizes all reasonable and useful relationships among journals, issues,articles, authors, classification, keywords, references and similar articles.We converted metadata to RDF and use a Visual Browser Java Appletthat runs in a web browser. We describe briefly the metadata nature, thenserver and client side of the visualization including data formats andconversions. There follows a description of the interaction between visualand textual interfaces.Key words: visualization, RDF, visual interface, Visual Browser, DML-CZ,EuDML

1 Introduction

This paper presents dynamic visual interface for browsing the Czech DigitalMathematics Library (DML-CZ) as an alternative to a textual listing. We areoffering the interface to the ongoing EuDML project1 [13]. The DML-CZ [1]currently contains more than 28,000 articles in 11 journals, 5 proceedings seriesand 28 monographs [6]. Users usually do not browse within such a vast amountof data, rather they search for titles or authors.

On the search results page users can see the number of search results andthe list of articles. When clicking on an article, the information listed below isshown:

– bibliographic information about the article (author, title, serial, year,Mathematics Subject Classification (MSC) [7], . . . );

– preview of the article and link to the PDF;– link to similar articles;– references with links to articles where possible.

1 The European Digital Mathematics Library – http://www.eudml.eu/



http://dml.cz


http://dml.cz




64 Zuzana Neverilová

The particular advantage of the DML-CZ interface is that it finds similararticles in search results. Three methods for calculating similarities areused [10] and the percentages are expressed graphically. This is so far theonly information that is visualized. Nevertheless according to [14] a goodvisualization helps accelerating the cognitive process, since the eyes can pickup details of the visualization and keep a holistic overview at the same time.Visualization is most suitable for complex and relatively sparse data and this isprecisely the case of library data.

Google has started to offer a graphical interface for search results in additionto the standard view: their so-called Google Wonder Wheel2 has both plain textand timelines. Information seekers who would tend to use it, are likely toappreciate it for not only Google searches.

The structure of paper is as follows. In Section 2 we describe the server sideincluding data formats provided by the server. Section 3 briefly describes theVisual Browser and shows the interaction between the Visual Browser and thetextual listing on the web page. Section 4 contains both the conclusion and thefuture development that the dynamic visual interface may undergo.

2 Server Side

Since the amount of data in DML-CZ is very large, a client-server architectureis the most suitable. The server has to store the data, provide a method for itsretrieval and quickly return a small amount of the data requested.

2.1 Data Formats

Because the client side uses RDF [2], the server has also to provide this format.We had to convert the existing XML format of metadata to RDF. This conversionrequired the following steps:

– selecting only the appropriate data for visualization (some information isomitted);

– assigning IDs for articles, issues, journals and authors;– adding short titles for the visualization;– conversion of the lang attribute according to RFC 3066 sec. 2.3 [12];– adding information about similar articles;– adding MSC labels.

2.2 RDF Server

Joseki RDF Server3 was used. It offers SPARQL [9] as a query language. Josekiwas selected because of the Jena Framework4 used in the client. Nevertheless,the server side can be substituted by any other RDF server if needed. The datais stored in a relational database.

2 http://www.googlewonderwheel.com/3 http://joseki.sourceforge.net/4 http://jena.sourceforge.net/

http://www.googlewonderwheel.com/

http://joseki.sourceforge.net/

http://jena.sourceforge.net/

Implementing Dynamic Visualization . . . to a Digital Mathematics Library 65

3 Client Side

On the client side two interfaces are used: a traditional textual interface (alist of authors and articles) and the Visual Browser [8]. The latter is a toolfor the dynamic (animated) visualization of RDF graphs. It provides flexiblevisualization thanks to the two-layer architecture:

– first layer—the data stored in RDF (whether in RDF/XML, N3 [3] orTurtle [5]);

– second layer—perspective of view, an XML description of graphic representa-tion of nodes and edges of the graph.

The visualization of different types of data is described below.The Visual Browser exists either as a standalone Java application or as a Java

applet. The applet can communicate with textual parts of the search resultspage. The interaction Java Applet—web page was made through AJAX5 plusJavaScript to communicate with the applet.

Submitting the search field or browsing data in one of the interfaces resultsin a SPARQL query. The server evaluates the query and returns an RDF graph.The XSLT [4] conversion is made and the result is returned as a list of authorsand titles. The communication between the applet and the web page is bi-directional: clicking on a name or title in the list renders a set of nodes andedges in the visual interface, a set of nodes and edges can be displayed as a listof authors and articles.

We expect users to type (part of) a name or title in the search box. Thenusers can browse either the more familiar textual interface (as they are usedto), or the visual one. Conversely, when viewing a particular subgraph in theVisual Browser, users can click to have it appear it in the textual interface asseen in Figure 1.

Fig. 1. Visual and textual interfaces to search results within DML-CZ. It allowsusers to choose how they browse the results. For this purpose, it is necessarythat the interfaces are able to communicate.

5 Asynchonous JavaScript And XML


3.1 Visualizing Metadata

In this visualization, nodes represent units such as authors, articles, issues,journals, and keywords and MSC as well. Different classes of units arerepresented by different colours and shapes. Mapping from logical entitiesto their visual attributes is fully configurable in Visual Browser.

Edges represent authors and their articles, articles in issues, issues injournals, as well as links between similar articles. Some of these relationsare structural (e.g. articles in issues), some are semantic (e.g. classification ofarticles), some have both aspects (authors of articles). We have to evaluateusers’ behaviour to decide what types of relations are useful for browsing.Even though we expect semantic relations to be more important than thestructural ones, we nevertheless display both types of relations. Similarly to thevisualization of nodes, the appearance of the edge (its colour, shape and length)distinguishes different classes of edges.

Pointing cursor on a node opens a small po-up window with a short text.This can be helpful when displaying titles or even abstracts as seen in Figure 2.

Fig. 2. Visualization of texts: pointing the cursor on a node, more informationpop-up in a small window

3.2 Visualizing Similarities

The current interface for DML-CZ provides information about semanticallysimilar articles. Similarities have been pre-calculated [11] using three differentmethods [10]. Similar articles are connected with edges of different lengths; theshorter the edge, the more similar two articles are (see Figure 3 on the nextpage).

Implementing Dynamic Visualization . . . to a Digital Mathematics Library 67

Fig. 3. Visualization of similarities: the length of the edge is also a bearer ofmeaning; with edge labels displayed, one can also see similarities expressed bynumbers

3.3 Visualizing References

Scientific articles usually cite other sources. These citations (references) arerelated to a topic mentioned in that article and therefore can help users whohave already read the article and are looking for further reading. The requiredstate will be that users can browse references to articles regardless of therepository these articles come from. Achieving a high coverage of at leastarticles’ metadata is one of the major goals of the EuDML project.

4 Conclusion and Future Work

We have presented an alternative to the current DML-CZ interface. Visualinterfaces are more attractive and can help orientation in complex data such aslibrary records. So far we do experiment with this, targeting at possibility toinclude it in the official DML-CZ site and offering it to the EuDML project.

Future work comprises monitoring users’ preferences on interfaces andtheir possible feedback. It will probably take some time before users getaccustomed to utilize the visual interface, since it is far from the traditionalway of browsing. But we hope that users will appreciate the holistic overviewof complex information.

Our immediate plans include working on the design of the search resultinterfaces. For this, users’ feedback will be necessary. We also have to test theRDF Server on the significant loads that are expected within DML-CZ andEuDML. These conditions seem necessary for usability within any real-worldDML. Working prototype can be seen on

http://dmlsearch.dml.cz/.


http://dmlsearch.dml.cz/


Acknowledgments

This research has been partially supported by the grant reg. no. 1ET200190513of the Academy of Sciences of the Czech Republic (DML-CZ), and by EU project# 250,503 in CIP-ICT-PSP.2009.2.4 (EuDML).

References

1. Bartošek, M., Lhoták, M., Rákosník, J., Sojka, P., Šárfy, M.: DML-CZ: The Objectivesand the First Steps. In: Borwein, J., Rocha, E.M., Rodrigues, J.F. (eds.) CMDE 2006:Communicating Mathematics in the Digital Era, pp. 69–79. A. K. Peters, MA, USA(2008)

2. Beckett, D., McBride, B.: RDF/XML syntax specification (February 2004), http://www.w3.org/TR/rdf-syntax-grammar/

3. Berners-Lee, T.: Notation 3 (2008), http://www.w3.org/DesignIssues/Notation34. Clark, J.: XSL transformations (XSLT) (1999), http://www.w3.org/TR/xslt5. David Beckett, T.B.L.: Turtle – terse RDF triple language (2008), http://www.w3.

org/TeamSubmission/turtle/6. DML: The Czech Digital Mathematics Library – news (2010), http://dml.cz/news,

retrieved April 20, 20107. Ion, P., Eilbeck, C.: Mathematics Subject Classification 2010 (2010), http://msc2010.

org/8. Neverilová, Z.: Visual Browser: A tool for visualising ontologies. In: Proceedings of

I-KNOW’05. pp. 453–461. Know-Center in coop. with Graz Uni, Joanneum Researchand Springer Pub. Co., Graz, Austria (2005)

9. Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF (2008),http://www.w3.org/TR/rdf-sparql-query/

10. Rehurek, R., Sojka, P.: Automated Classification and Categorization of MathematicalKnowledge. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk,F. (eds.) Intelligent Computer Mathematics—Proceedings of 7th InternationalConference on Mathematical Knowledge Management MKM 2008. Lecture Notesin Computer Science LNCS/LNAI, vol. 5144, pp. 543–557. Springer-Verlag, Berlin,Heidelberg (Jul 2008)

11. Rehurek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora.In: Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks.pp. 45–50. Valletta, Malta (2010), software available at http://nlp.fi.muni.cz/projekty/gensim

12. RFC3066: Tags for the identification of languages (January 2001), http://potaroo.net/ietf/idref/rfc3066/

13. Sylwestrzak, W., Borbinha, J., Bouche, T., Nowinski, A., Sojka, P.: EuDML—Towardsthe European Digital Mathematics Library. In: Sojka, P. (ed.) Proceedings of DML2010, pp. 11–26. Masaryk University Press, Paris, France (Jul 2010)

14. Tufte, E.: Envisioning Information. Graphics Press (1990)

http://www.w3.org/TR/rdf-syntax-grammar/

http://www.w3.org/TR/rdf-syntax-grammar/

http://www.w3.org/DesignIssues/Notation3

http://www.w3.org/TR/xslt

http://www.w3.org/TeamSubmission/turtle/

http://www.w3.org/TeamSubmission/turtle/

http://dml.cz/news

http://msc2010.org/

http://msc2010.org/

http://www.w3.org/TR/rdf-sparql-query/



http://potaroo.net/ietf/idref/rfc3066/

http://potaroo.net/ietf/idref/rfc3066/

Data Enhancementsin a Digital Mathematical Library

Michal Ružicka and Petr Sojka

Masaryk University, Faculty of Informatics,Botanická 68a, 602 00 Brno, Czech Republic,[email protected], [email protected]

Abstract. The quality of digital mathematical library depends on theformats and quality of data it offers. We show several enhancementsof (meta)data of the Czech Digital Mathematics Library DML-CZ. Wediscuss possible minimalist modification of regular LATEX documents thatwould simplify generating basic metadata that describes the article in anXML/MathML format. We also show a proof of concept of a method thatenables us to include LATEX source code of mathematical expressions intopdfTEX-generated PDFs in such a way that the reader can Copy & Pastethe code from his PDF viewer. This code, hidden in the PDF file, can alsobe used for LATEX math indexing.

Key words: metadata generation, XML, MathML, PDF, copy-math

1 Introduction

Since 2005, a digital mathematics library has been under development in theCzech Republic. The goal of the Czech Digital Mathematics Library project(DML-CZ) [3] is the preservation in digital form of the contents of the major partof mathematical literature ever published in the Czech lands, and to providefree and public access to the digital content and bibliographical data.

The DML-CZ development was officially completed at the end of the 2009.The aim of this article is to give a short summary of some of the techniquesthat facilitated the success of this project.

A LATEX document workflow consists of several steps, some of them canbe reworked to enhance the final versions of documents that are stored ina digital repository. Besides postprocessing final PDF files [5], we can modifythe processing of the document that a journal editor typically does, (can beseen in Section 2) and enrich the document source code itself (Section 3).

In this article we intend to show how a slight modification to regular LATEXdocuments and classes enabled us to prepare DML-CZ metadata with onlyslight modification to the current workflow of the editors of mathematicaljournals involved.

The EuDML project [4] has already been launched and it is hoped thatthe DML-CZ results can be applied to it. Despite being officially finished,the result of—the Czech Digital Mathematics Library—project is here and




http://dml.cz

http://dml.cz

http://dml.cz

http://dml.cz


http://dml.cz



70 Michal Ružicka, Petr Sojka

we intend to continue developing in further. One possible contribution couldbe our method of including LATEX source code of mathematical expressionsinto pdfTEX-generated PDFs in such a way that the reader can Copy & Paste itdirectly from his PDF viewer. A PDF file of this kind could also be used forLATEX math indexing.

Proof of concept of this technique is shown in the second part of this article.

2 Minimalist XML Metadata Extraction

Although the greater part of the DML-CZ project was retro-digitization—whichinvolved scanning, OCR and finally processing the paper-only documents forthe digital format—, future developments of the library depend on how thenew issues of the mathematical journals are processed. With this in mind, it hasbeen necessary to prepare appropriate software support for the mathematicaljournals involved that will enable editors to prepare DML-CZ data easily.

The first approach was a complex system inspired by the French CEDRAMproject [8,2]. It automates many of the standard procedures of the journal issuepreparation [10]. Although the French system is used by the editors at theArchivum Mathematicum [1], not everyone there was willing to adopt sucha complex system which seriously disrupted their current workflow.

We therefore prepared a minimalist set of LATEX macros in the form ofa LATEX macro package. This package can be easily customized to meet needs ofa particular journal document class / style file. The LATEX macro package itselfdoes not transform the LATEX source code to XML. Rather, this package literallyexports selected parts of the LATEX document to an external file in such a waythat it forms a simple LATEX document. This occurs without any expansion ofthe LATEX code; TEX toks registers are used (using the standard LATEX outputsystem—\newwrite, \openout, \write, \closeout). This file is subsequentlyprocessed by a journal-independent Tralics-based procedure, which is describedin the next section.

The Tralics program [9,7]—a LATEX to XML translator—has proved itselfan adequate translator of the LATEX code to XML.1 Use of Tralics is the mostindispensable part of the system. Its engine is able to process regular LATEX codewhich obviates converting the LATEX code to plain text directly; nor do we haveto deal with the LATEX macro expansion or the complexity of its syntax. Tralicsoutputs a UTF-8 encoded XML file.

This output is finally processed by the XLST processor furnishing DML-CZmetadata in its final form. A schema of the process can be seen in Figure 1.

At the same time as the final PDF document is created, the metadata isautomatically generated based on the same source code. Thus, we can be surethe metadata is correct and up-to-date unlike the situation in which the editorsprepare metadata ‘by hand’ or generate it asynchronously.

Even if the editor used another incarnation of TEX, instead of LATEX it shouldstill be possible to export the necessary data in such a way that the result

1 Tralics is also used in the complex system of the Archivum Mathematicum journal.

http://dml.cz

http://dml.cz


http://dml.cz

Data Enhancements in a Digital Mathematical Library 71

pdflatex

article.tex

article.dml.tex

article.pdf

tralics

XSLTreferences.xml

meta.xml

article.dml.xml

Article processing

Metadata extraction

Fig. 1. Schema of work of the minimalist metadata extraction system

would be acceptable as Tralics input. (This data is a small subset of the wholedocument and includes just the title, author names, abstract, keywords etc.)

This solution is also platform independent—both the TEX and Tralics aremulti-platform programs.2 The XSLT processors are also available for allstandard operating systems.

The minimalist extension of the current editorial workflow does not useBibTEX for references processing since not all the editors are willing to uesit. This is despite the fact that it is supported directly by the Tralics program.A special set of macros is used instead to mark up the structure of eachbibliography record giving them more flexibility in bibliography formatting.

Since Tralics supports MathML thus we are able to translate mathematicalexpressions from the input LATEX notation to this XML language. Because wedecided to support just a controlled subset of the ‘well known’ LATEX macros inthe DML-CZ metadata, it is easier to achieve a correct MathML translation.

2 The Tralics program uses dynamic libraries of the Cygwin project (http://www.cygwin.com/) to run on the MS Windows operating system.


http://dml.cz

http://www.cygwin.com/

http://www.cygwin.com/


3 Copy Math—a Proof of Concept

The DML-CZ project stores full texts of the articles as PDF files as do manyother digital libraries. PDF is widely adopted and very often used for electronicpublications. Thanks to PdfTEX, PDF is also the de facto standard output formatof the modern TEX distributions.

Being capable of high quality mathematical typesetting, TEX is widely used.LATEX mathematical notation is well known, effective, and used not only in LATEXdocuments, but also in a variety of other projects, such as Wikipedia.

Thus, LATEX source code is usually a good choice for plain text representationof mathematical expressions. Users and maintainers of repositories of digitaldocuments themselves demand plain text for the content of PDF documents—inJapan, regular PDF documents are processed using OCR (optical characterrecognition) techniques to obtain plain text representation of math fromPDFs [6,11].

Unfortunately, PdfTEX-produced PDF documents do not provide theirreaders with this kind of output if they use Copy & Paste functions of theirpreferred PDF reader.

A LATEX document with a following body part has the PdfTEX generatedPDF as shown in Figure 2.

\begin{document}Text$\Pi(x) = \pi(x) + \frac{1}{2}\pi(x^{1/2}) +\frac{1}{3}\pi(x^{1/3}) + \cdots$

text.\end{document}

The content of the document is selected properly but the result of theCopy operation is malformed mixture of unicode characters. To address this

Fig. 2. CopyMath disabled PDF document

http://dml.cz


Fig. 3. CopyMath enabled PDF document

inconvenience we decided to use the ActualText command of the PDF languageto mark the region of the mathematical expression inside the PDF documentand allow PDF readers to provide their users with the LATEX source code of theexpression. Figure 3 shows the PDF file that resulted from the same documentwith our experimental CopyMath macro package switched on.

Mathematical expressions are not selected visually; the result of the Copyoperation is the original LATEX source code itself:

Text $\Pi (x) = \pi (x) + \frac {1}{2}\pi (x^{1/2}) +\frac {1}{3}\pi (x^{1/3}) + \cdots $ text.

The implementation is not easy because we want the package to be as userfriendly as possible—users should not be forced to modify their mathematicalexpressions in any way, \usepackage{copymath} should cater for all their needs.However, this requires nonstandard modifications of the LATEX mathematicalenvironments.

To implement CopyMath we need to add \pdfliteral at the beginningand end of every mathematical environment. The dollar sign ($) is activated(\catcode‘$=13) and redefined. It is necessary to keep track of nestedmathematical environments (e.g. $a\mbox{$b$}c$), and double-dollar display-math syntax ($$a + b$$) adds another layer of complication.

To redefine LATEX mathematical environments (\begin{math}...\end{math},\begin{eqnarray}...\end{eqnarray} etc.) we keep the original definition oftheir opening (\let\normalequation\equation) and closing commands. Theenvironment is consequently redefined using our auxiliary macros. The open-ing command is substituted for a macro that scans tokens until the closingcommand of the mathematical environment is achieved. And we must neverlose sight of nested environments. The scanned content of the mathematicalenvironment is used to prepare a \pdfliteral code. The \pdfliteral code


and the original content of the mathematical environment are used by anotherauxiliary macro that is used instead of the closing command of the originalmathematical environment.

Here is an example of CopyMath macro definitions:

%% Auxiliary macros.\newcounter{nestedmath} \setcounter{nestedmath}{0}%\newtoks\copymath@envgetbuffera\newtoks\copymath@envgetbufferb%\long\def\copymath@envget#1#2\end #3{%

\copymath@envgetbuffera=\expandafter{\copymathenvput}%\def\copymath@envtempa{#3}\def\copymath@envtempb{#1}%

\ifx\copymath@envtempa\copymath@envtempb%\copymath@envgetbufferb={#2}%\def\copymath@envgetnext{\end{#1}}%

\else%\copymath@envgetbufferb={#2\end{#3}}%\def\copymath@envgetnext{\copymath@envget{#1}}%

\fi%\global\edef\copymathenvput{%

\the\copymath@envgetbuffera \the\copymath@envgetbufferb}%\copymath@envgetnext}

%\long\def\copymathenvget#1{%

\gdef\copymathenvput{}\copymath@envget{#1}}%

%% $\let\@origensuredmath=\@ensuredmath%\def\normalinlinemath#1{%\ifnum\value{nestedmath}>0 \@origensuredmath{#1}%\else%

\addtocounter{nestedmath}{1}%\pdfliteral{/Span << /ActualText<\pdfescapehex{\detokenize{$#1$}%

}> >> BDC}%$#1$%\addtocounter{nestedmath}{-1}%\pdfliteral{EMC}%

\fi}%\let\@ensuredmath\normalinlinemath%\catcode‘$=13

%% \begin{equation}...\end{equation}\let\normalequation\equation\let\normalendequation\endequation


\renewenvironment{equation}%{\copymathenvget{equation}}%{\ifnum\value{nestedmath}>0 \message{You cannot nest equation}%\else%

\normalequation%\addtocounter{nestedmath}{1}%\pdfliteral{/Spanx << /ActualText<\pdfescapehex{%

\detokenize{\begin{equation}}\copymathenvput\detokenize{%\end{equation}}}> >> BDC}%

\copymathenvput%\addtocounter{nestedmath}{-1}%\pdfliteral{EMC}%

\normalendequation%\fi}

Unfortunately, it seems that this approach is not as universal as expected.For example, it is not possible to directly use this kind of macro redefinitionfor AMS-LATEX mathematical environments and this has necessitated a complexmacro redefinition. Another possible solution should be preprocessing of thesource code using an external tool. This approach, however, would need to dealwith the complexity of the LATEX syntax.

4 Conclusions

Minimalist modifications of the current editorial workflow proved to be an easyway of moving mathematical journal editors to a digital-library-friendly state.Tralics provides us with sufficient functionality to perform this easily and withplatform independence.

The CopyMath macro package shows an alternative route to improvingpdfTEX-generated PDFs, but the proper redefinition of all possible mathematicalenvironments cannot be expected to be easy.

References

1. Archivum Mathematicum. [online], http://www.emis.de/journals/AM/, MasarykUniversity, Brno, Czech Republic. Last modified December 18, 2009. [cit. 2010-04-25].

2. Centre de diffusion de revues académiques mathématiques. [online], http://www.cedram.org/, [Center for diffusion of mathematic journals]. [cit. 2008-05-25].

3. Czech Digital Mathematics Library. [online], http://dml.cz/, [cit. 2010-04-24].4. EuDML: The European Digital Mathematics Library. [online], http://www.eudml.

eu/, This page was last modified on 20 January 2010, at 08:09. [cit. 2010-04-25].5. Hatlapatka, R., Sojka, P.: PDF Enhancements Tools for a Digital Library. In: Sojka, P.

(ed.) Proceedings of DML 2010, pp. 69–76. Masaryk University Press, Paris, France(Jul 2010).

6. Infty Project: Research Project on Mathematical Information Processing. [online],http://www.inftyproject.org/en/, [cit. 2010-06-02].

http://www.emis.de/journals/AM/



http://dml.cz/



http://www.inftyproject.org/en/


7. Tralics: a LaTeX to XML translator. [online], http://www-sop.inria.fr/apics/tralics/, Last modified $Date: 2009/11/24 17:17:03 $ [cit. 2010-04-24].

8. Bouche, T.: A PdfLATEX-based automated journal production system. TUGboat 27(1),45–50 (2006), In Proceedings of EuroTEX 2006.

9. Grimm, J.: Tralics, a LATEX to XML Translator. TUGboat 24(3), 377–388 (2003), InProceedings of EuroTEX.

10. Ružicka, M.: Automated Processing of TEX-Typeset Articles for a Digital Library. In:Sojka, P. (ed.) DML 2008 – Towards Digital Mathematics Library. pp. 167–176 (2008),Birmingham, UK, July 27th, 2008.

11. Suzuki, M., Kanahori, T., Ohtake, N., Yamaguchi, K.: An Integrated OCR Softwarefor mathematical Documents and Its Output with Accessibility. In: ComputersHelping people with Special Needs. Lecture Notes in Computer Sciences, vol. 3119,pp. 648–655. Springer (2004), 9th International Conference ICCHP 2004, Paris, July2004.

http://www-sop.inria.fr/apics/tralics/

http://www-sop.inria.fr/apics/tralics/

Part IV

Digitization Reports

bdim: the Italian Digital Mathematical Library

Vittorio Coti Zelati

Dipartimento di Matematica e ApplicazioniUniversità degli Studi di Napoli “Federico II”

via Cintia, M.S. Angelo80126 Napoli, Italy

Abstract. We present bdim (Bibliteca Digitale Italiana di Matematica) theItalian project of math digitization. The project has been started by SIMAI(Società Italiana di Matematica Applicata e Industriale) and UMI (UnioneMatematica Italiana) with initial support from the Biblioteca DigitaleItaliana and the Italian Ministry of Beni and Attività Culturali and withthe help of Numdam. At the moment bdim consists of approximately1,300 articles, 11,000 pages (articles from Bollettino Unione MatematicaItaliana, 1946–1967).

1 Italian Math Journals

In Italy there are many math journals published by Mathematics Departmentsor by Scholarly Societies. In the last few years many of them have starteddistributing the full-text online and some of them have decided to switch toa commercial editor. A non exhaustive list of journals and of their presentsituation is the following:

– Journals that have decided to go with a commercial editor: Annali diMatematica Pura ed Applicata (Springer since 2001, online on SpringerLink),Rendiconti Lincei – Matematica e Applicazioni (EMS Publishing House since2005, online with the publisher starting from the 2005 volume), Ricerche diMatematica (Springer since 2006, online on SpringerLink starting from the2006 volume), Annali dell’Università di Ferrara (Springer since 2006, onlineon SpringerLink), Rendiconti del Circolo Matematico di Palermo (Springer since2008, online on SpringerLink)

– Journals that have been digitized by Numdam: notably Annali della ScuolaNormale Superiore di Pisa. Classe di Scienze and Rendiconti del SeminarioMatematico della Università di Padova.

– Journals which have not been (or have been only partially) digitized:Bollettino dell’Unione Matematica Italiana, Note di Matematica (Lecce, online inLecce), Rendiconti di Trieste (on-line in Trieste), Rendiconti di Matematica e dellesue Applicazioni (Roma, partly on-line in Rome), Atti del Seminario Matematicoe Fisico dell’Università di Modena (Modena), Istituto Lombardo. Accademia diScienze e Lettere. Rendiconti. Scienze Matematiche e Applicazioni (Milano), LeMatematiche (Catania), Rivista di Matematica della Università di Parma (Parma),




80 Vittorio Coti Zelati

Università e Politecnico di Torino. Seminario Matematico. Rendiconti (Torino,partly on-line).

There are also Journals which are no more published, like the Giornale diMatematiche di Battaglini.

We do not have precise figures, but for sure more than 300,000 pages arestill to be digitized (about 100,000 for the Bollettino dell’Unione MatematicaItaliana).

Of the different categories, only the journals which have decided to go withNUMDAM are fully integrated with DML. Two of the Journals now with acommercial editor (Rendiconti Lincei and Ricerche di Matematica) have not beenfully digitized, and should join our project.

2 The bdim Project

The bdim project has been started by SIMAI (Società Italiana di MatematicaApplicata e Industriale) and UMI (Unione Matematica Italiana) in connectionwith the international effort to create DML, the Digital Mathematical Libraryand in collaboration with the BDI (Biblioteca Digitale Italiana, the Italian DigitalLibrary).

Our goal is to search for funding, digitize and provide to the many Italianmath journals a commun repository, a better international visibility and aninterface to DML and BDI.

The initiative has received a grant of 15,000 Euro from the ItalianGovernment to define the standards, to digitize some material and to implementthe repository and the web interface needed for the dissemination of thematerial.

At the moment, we have digitized the Bollettino dell’Unione MatematicaItaliana Serie III (published in the period 1946–1967, consisting of 22 volumes,83 issues, 1,358 articles, 11,390 pages) and we are testing the repository. Thetest repository is accessible at the address http://bdim.dma.unina.it/ (it willchange). We have also implemented an OAI-PMH server, compatible withthe Mini-DML standard, accessible at the address http://bdim.dma.unina.it:8080/oaiprovider/?verb=Identify (also this address will change).

We are now acquiring also the Bollettino dell’Unione Matematica Italiana SerieVIII, period 1998–2007. Meanwhile we are extending the initiative to otheritalian journals and we are seeking for the necessary financial support.

3 The Implementation

In implementing our project we have tried to follow the example of Numdam,from whose staff we have had a lot help and advice. We have also adhered to thestandards suggested by the DML project and to those required by the BibliotecaDigitale Italiana, in particular the MAG standards (Metadati Amministrativie Gestionali, http://www.iccu.sbn.it/genera.jsp?id=267) required by the

http://bdim.dma.unina.it/


http://bdim.dma.unina.it:8080/oaiprovider/?verb=Identify

http://bdim.dma.unina.it:8080/oaiprovider/?verb=Identify

http://www.iccu.sbn.it/genera.jsp?id=267

bdim: the Italian Digital Mathematical Library 81

Istituto Centrale per il Catalogo Unico delle Biblioteche Italiane e per le InformazioniBibliografiche http://www.iccu.sbn.it/.

Each page has been digitized at the resolution of 600 dpi (300 dpi forpictures, photos, etc). To each page has been associated the text obtained withOCR. Each issue has then been segmented into articles, and and a PDF and aDjVu file have been produced for each article.

We have decided to use a Fedora repository (http://www.fedora-commons.org/) to manage, organize and preserve our material. Fedora is a flexible, welltested project, platform independent and well suited for our needs. From theoverview of Fedora:

In a Fedora repository, all content is managed as data objects, eachof which is composed of components (“datastreams”) that containeither the content or metadata about it. Each datastream can be eithermanaged directly by the repository or left in an external, web-accessiblelocation to be delivered through the repository as needed. A data objectcan have any number of data and metadata components, mixing themanaged and external datastreams in any pattern desired.Each object can assert relationships to any number of other objects,providing a way to represent complex information as a web ofsignificant meaningful entities without restricting the parts to a singlecontext.

Our (Fedora) repository has been organized as follows: we have identifiedas digital object each journal, each volume, each issue and each article. Eachdigital object has at least one datastream: an XML file which contains boththe administrative and gestional information on the object as well as thebibliographical information on the object itself. Each digital object might havesome other datastream: for example the articles objects have the correspondingPDF, DjVu and OCR files, the issues the TIFF files of all the pages making upthe issue.

The Fedora repository (http://bdim.dma.unina.it:8080/fedora/, notaccessible from the web) runs on tomcat web server. The web interface(http://bdim.dma.unina.it/) to the repository runs on a different web server(at the moment on the same machine), is written in PHP, obtains the datafrom the repository and (modulo an XSLT transformation) disseminate it. Forexample, to build the page for an article, or an issue of a journal, the web server“ask” to the repository for the relevant XML file, and builds the web page fromthere. The search engine (not yet implemented in the test repository where asimpler search engine is running) is based on a service of Fedora (GSearch)which selectively harvests content and metadata from objects and indexes them.Also the OAI server http://bdim.dma.unina.it:8080/oaiprovider, which isanother of Fedora’s services, harvests the data from the Fedora server anddisseminate it.

http://www.iccu.sbn.it/

http://www.fedora-commons.org/

http://www.fedora-commons.org/

http://bdim.dma.unina.it:8080/fedora/

http://bdim.dma.unina.it/

http://bdim.dma.unina.it:8080/oaiprovider

INSPIRE: Realizing the Dream ofa Global Digital Library in High-Energy Physics

Annette Holtkamp?, Salvatore Mele, Tibor Šimko, and Tim Smithon behalf of the INSPIRE Collaboration

CERN, 1211 Geneve 23, [email protected] [email protected]

[email protected] [email protected]://cern.ch

Abstract. High-Energy Physics (HEP) has a long tradition in pioneeringinfrastructures for scholarly communication, and four leading laborato-ries are now rolling-out the next-generation digital library for the field:INSPIRE. This is an evolution of the extraordinarily successful, 40-yearsold SPIRES database. Based on the Invenio software, INSPIRE already pro-vides seamless access to almost 1 million records, which will be expandedto cover multimedia, data, software, wikis. Services offered include cita-tion analysis, fulltext search, extraction of figures from fulltext and searchin figure captions, automatic keyword assignment, metadata harvesting,retrodigitization, ingestion and automatic display of LATEX, and storageof supplementary materials like Mathematica notebooks. New servicesare in different phases of design or implementation, in strategic partner-ships with all other information providers in the field and neighbouringdisciplines, including; automatic author disambiguation, user tagging,crowdsourcing of metadata curation, automatic document classification,semantic analysis, innovative metrics, recommender systems, object ag-gregation with OAI-ORE definition, integration of OAIS standards forlong-term document preservation.

Key words: digital library, high-energy physics, INSPIRE, Invenio,metadata curation

1 Introduction

High-Energy Physics (HEP) takes pride in a long tradition of pioneeringinfrastructures for scholarly communication, with half a century of practice inpreprint dissemination and two decades of expertise in running repositories [1].It is rapidly evolving its scholarly communication platforms to realise the hopesof the e-science era. With the recent launch of the INSPIRE system [2], HEPscientists are seeing their dream come true of a digital library encompassingthe complete corpus of their scientific output and providing state-of-the artinformation tools to optimize their research workflow.

? on leave of absence from DESY, Notkestr. 85, D-22607 Hamburg, Germany






http://cern.ch



84 A. Holtkamp, S. Mele, T. Šimko, T. Smith

Global collaboration is needed to create a platform that satisfies the needs ofscholars for easy and unrestricted access to comprehensive scientific informationin their field and neighboring disciplines and for powerful discovery tools. Thusthe four leading HEP laboratories in Europe and the US have joined forcesto develop the next-generation information platform, INSPIRE, tailored tothe specific needs of the HEP community. CERN, DESY, Fermilab and SLACare working in synergy with arXiv.org [3], publishers and other informationproviders in the field to build and operate INSPIRE as an evolution of theextraordinarily successful SPIRES database [4]. Based on the Invenio [5] OpenSource digital library software developed at CERN, INSPIRE provides seamlessaccess to almost 1 million records and will in the near future extend its scopeto include supplementary material, multimedia, data, software, wikis. It willenable novel text- and data-mining applications and deploy new metrics toassess the impact of articles and authors.

This paper will outline the services currently offered by INSPIRE as well asnew features presently being designed and implemented. Since the last decadeshave witnessed a growth of interdisciplinary ties between HEP and mathematics,the focus will be on describing strategies to solve current challenges commonto HEP and mathematics.

2 The HEP Information Landscape

HEP scientists work in a relatively small, closely-knit community consisting of20–30,000 researchers. About 50% of them are theorists writing 80% of all HEPpapers in small global collaborations of less than 10 authors. The other half areexperimental physicists mostly working at big research centres in large globalcollaborations, exemplified by the fact that the recent papers published by theLHC collaborations at CERN carry more than 2,000 authors.

Particle physicists have always been driven by the need for rapid sharing ofideas and research results. This desire for speed in combination with the globalinterconnectivity of the HEP research community led to the early developmentof a preprint culture. Today, more than 90% of all HEP journal articles aresubmitted to arXiv.org. But already in the 1960s it was common practice forHEP authors and institutes to distribute paper copies of articles worldwidebefore their publication in journals. In 1974, out of a library catalog of thesepreprints, the SPIRES-HEP database was born [6]. In Dec 1991, SPIRES-HEPbecame the first database on the web. Some months before, the first e-printarchive, now known as arXiv.org, was set up. Since then, a symbiotic relationshiphas developed between these two community-driven information systems.

The SPIRES database, jointly run by SLAC, DESY and Fermilab, nowcontains more than 850k bibliographic records (preprints, journal articles,conference contributions) covering the entire HEP literature and many papersfrom related fields. Its human-curated metadata includes links to fulltext, authoraffiliations, citations, publication information, keywords from a HEP taxonomyand much more. Currently, about 100k searches are performed per day.

http://arXiv.org

INSPIRE: A Global Digital Library for High-Energy Physics 85

As a consequence of the decades worth of trusted, curated content it containsand its user-driven evolution SPIRES enjoys an overwhelming popularity withinits worldwide user community. In a survey performed in the spring of 2007,91.4% of the participants mentioned the community-based systems SPIRES andarXiv as their favourite information source [7]. The poll also highlighted thefact that SPIRES’ aging technological infrastructure presented a severe obstacleto fulfilling the future information needs of its user community. Therefore inMay 2007, at the 1st HEP/PPA Information Resource Summit [8] the SPIREScollaboration joined forces with CERN to develop INSPIRE, the next-generationgateway to all HEP relevant information. A public beta version is accessiblesince April 2010 [2].

3 INSPIRE Overview

By migrating SPIRES to the Invenio platform, a modern open-source multimediadigital library software developed at CERN, cutting-edge information tools havebeen put at the disposal of particle physicists. Invenio’s strengths include speed,scalability to millions of records, a flexible metadata model supporting a varietyof document types (articles, photos, videos), personalization and collaborativefeatures, and a multilingual interface with support for 25 languages. Inveniouses a modular architecture and relies on acknowledged standards likeMARCXML [9] for storing bibliographic data or OAI-PMH for metadataexchange [10]. As part of the Open Source community, the software is availableunder the GNU General Public License, and has over 25 production instancesworldwide.

Besides supporting the traditional SPIRES specific search syntax, INSPIREenables Google-like free keyword searches across metadata and fulltext.Invenio’s powerful search engine allows most queries to be executed in afraction of a second, even for a repository with a million records.

Moving beyond SPIRES’ traditional role as a metadata store, INSPIRE willact as a fulltext repository hosting all freely accessible preprints, journal articles,conference contributions, and theses, enabling fulltext search and displayingsnippets of text surrounding search terms on the results page, as shown inFig. 1. Negotiations with publishers are under way to extend this functionalityto access-restricted articles, especially with a view to articles predating arXiv.A first agreement has been signed with Springer in April 2010.

For each article, a detailed page shows abstract, keywords, publicationinformation, links to different fulltext versions and to a wealth of additionalinformation. Work is in progress to extract figures from all arXiv papersrecorded in INSPIRE and to display them as a film strip on the detailed recordpage, as exemplified in Fig. 2 on the following page.

The figures are extracted from arXiv source tarballs and associated to paperrecords. The TEX sources of arXiv papers are parsed in order to extract thecaptions associated with each figure. The TEX formatting used in captions isstored as such in bibliographic records and is displayed in the user browser



Fig. 1. Search results page with fulltext snippets

Fig. 2. Detailed article page with plot slider

via the jsMath library [11]. Storing of captions in TEX permits them to beindependently searchable, not only for words and phrases, but for TEX symbolsas well.

A further strength of INSPIRE is its citation analysis. The “co-cited with”network gathers information about papers which are frequently cited togetherwith the paper of interest, opening new paths to find related articles. A citationhistory graph visualizes citation counts of an article over time, enabling easydiscovery of various characteristic citation time patterns such as a “sleepingbeauty”, one example being shown in Fig. 3 on the next page.

The metadata, full-text, figure caption, citation, and other search indexescan be mutually combined, contributing to the unprecedented level of


Fig. 3. Citation page with co-citations and citation history graph

search flexibility INSPIRE will offer. For example, the search “author:Elliscaption:model cited:10→20 reference:astro” will return all papers by an authornamed “Ellis” that contain the word “model” in a figure caption, have beencited between 10 and 20 times, and that reference some astrophysical arXivpaper.

Another example of INSPIRE’s novel features are author pages which arebuilt dynamically, as exemplified in Fig. 4 on the following page. An authorpage provides a comprehensive profile of a scientist, containing information onaffiliation history, research subjects, frequent coauthors, breakdown of articlesaccording to their type (journal article, conference contribution, lectures etc) aswell as breakdown of articles with respect to their citation counts. The “citation


summary” format is suited to give some indication of the impact not only of asingle scientist, but may also be applied to institutions, countries or the outputof any query.

Fig. 4. Author page aggregating various information


As a clear response to a request from the community whose experimentalcollaborations now count author lists of over 2,500 scientists, work is under wayto uniquely identify authors and link them unambiguously to their scientificoutput. INSPIRE has developed its own author identification scheme and isa leading participant in the ORCID initiative [12] to establish interoperabilitybetween different author identification projects and resolve the problem ofauthor ambiguity on a global scale. Based on its detailed knowledge about ascientist’s research topics, coauthor network, affiliation history, citation patternsand so on, INSPIRE is able to resolve author name ambiguities and to calculatedegrees of probability for an article to be written by a certain author. To givesome indication of the performance, for a set of 963 documents with authorname written as “Chen, G”, 21 distinct real authors have been identified. Only22 out of 963 documents were not associated with one of these authors, givingthe algorithm in this case a success rate of 97.2%. As a next step, an interface isunder development to allow registered authors to claim their papers, furtherfeeding into the overall data quality for that given author and, through theco-authorship network, of the whole database. Articles are categorized byprobability of ownership and displayed to the presumable author who is askedto confirm or reject these attributions. In addition, an option is offered to claimpapers that have not been suggested or to submit papers not yet includedin INSPIRE. The author names are internally represented in Unicode UTF-8character encoding within INSPIRE, enabling association of translated authornames with the names in their original languages, including ideograms.

4 Outlook

The beta version of INSPIRE is now operational, reproducing and improving thebasic services which have powered the community, with SPIRES, over decades:

– central access to the complete HEP literature– high-quality human-curated metadata– very fast search engine enabling Google-like free keyword searches– taxonomy-based classification– comprehensive author pages– extensive citation analysis

The next important step will be to roll-out personal accounts. These willbe activated within the next few months, enabling features like personalbookshelves, email notification alerts and RSS feeds, personalized displayformats and tools for sharing information within a collaboration. Powerfulincentives for the creation of personal accounts will be the claiming of articles,an improved system for notifying missing references, and tools for annotatingand organising bibliographies. These services are known from SPIRES to havebeen on the "desiderata" list of the community for a long time.

Personal accounts will also open the door to the porting of tested Web 2.0models of user-generated content into a large-scale digital library. In the 2007


user poll, 63% of the respondents expressed their willingness to spend at leasthalf an hour per week on enriching the database content [7]. A first attempt toharness this amazing potential will be to encourage users to tag content.

An important evolution of INSPIRE with respect to SPIRES is the possibilityof hosting documents and other materials, rather than just linking to them.Users registered in Inspire will therefore have an opportunity to upload writtenmaterial that they would not submit to the arXiv. A classic example is oldermaterial, from theses to unpublished documents, that they would like to seeonline, but, not being of recent origin, they do not want included in arXiv alerts.As an immediate extension of this possibility Inspire can allow, within moderatestorage limits, the uploads of other kind of documents, like Mathematicanotebooks, software source code, additional graphs, small data sets—not onlyas supplementary material directly attached to articles but, moving beyondthe article-centric model, as independent citable objects. The centralisation ofthis material, away from personal web pages, and in a clear format linked topublications, is another long-standing desire of the community. A corollary ofthis rich harvest of additional objects will be their aggregation (either from thecurators, or crowdsourced) into a single view of the same idea. The OAI-OREstandards definitions [13] are under consideration as a scheme to aggregaterelated objects.

Semantic techniques for information classification retrieval are currentlyunder development, based on a taxonomy of HEP concepts [14]. By exploitingsynonyms, more comprehensive search results will be achieved. Anotherapplication currently being refined is the automatic categorization of materialon ingestion so that a paper is automatically recognized e.g. as a conferencetalk on renormalization in perturbative quantum field theory or as a thesis onthe electroweak model in noncommutative geometry. Other features to comeare faceting of search results and a recommender system to suggest similarmaterial based on combined citations, keywords, and usage pattern data.

Thanks to its role as central HEP information system, INSPIRE is ideallyplaced to become an essential agent in digital preservation of particular classesof documents. On the grey literature side, a lot of effort has already beeninvested in retrodigitizing research papers and theses of the four laboratoriesrunning INSPIRE. These were inaccessible so far and are now archived in apersistent digital format. Services of preservation on demand for users willbe made possible for all additional material discussed above, from small datafiles to Mathematica notebooks, from conference slides to multimedia. Anadditional incentive for preservation will be the fact that INSPIRE will makethis material discoverable and citable. Another use case for preservation is thedocumentation that large experimental collaborations produce in support oftheir scientific analyses which is today locked either in notes or in twikis. Theseare as persistent as the organizations which created them, poised to move onto other scientific endeavours. An effort is under way for the ingestion of thismaterial in INSPIRE, linked to the original publication to which it refers, withcorrect provenance information and access rights reflecting the policies of the


scientific groups which prepared this material. To this extent necessary stepswill be taken to make INSPIRE OAIS compliant [15]. As an aside, this processwill also enable innovative metrics to take into account nontraditional forms ofscientific results.

HEP as a field has long been vigilant to seize interdisciplinary opportunities.A notable example is the large overlap in literature and in scientists withastronomy, astrophysics and astroparticle physics. As a consequence, the digitallibraries of these fields are moving closer together. Astronomy and astrophysicshave long relied on the ADS (Astrophysics Data System) [16] run by theHarvard Smithsonian Astronomy Observatory under a NASA grant. Startingwith a rich metadata exchange the collaboration between ADS and INSPIRE isevolving towards a joint curation of records of common interest as well as ajoint development of full-text search and recommender systems. This will befacilitated by the move of part of the ADS operations to the Invenio platform aswell.

It is easy to imagine that a similar, tight, collaboration could be initiated withan emerging digital library for mathematics. A large amount of mathematicaltools are used by theoretical HEP scientists, which could benefit from a morepowerful set of discovery and retrieval opportunities through the interfacingof INSPIRE and such a digital library for mathematics. At the same time,cross-disciplinary records could be curated only once, information on authoridentification across the systems could be streamlined, and citations could befollowed seamlessly.

In conclusion, several lessons have been learnt in the inception of INSPIRE,the transition from SPIRES to INSPIRE and the planning of future services. Themost relevant are that a careful analysis of users’ needs and desires should bethe driving force of all planning, and that a wide-range search for synergies andagreements across all information providers can accelerate development anddeployment of new services. Both lessons may seem obvious, but there is alwaysa risk in these kind of projects that user-pull loses against technology-push,collaboration against silo mentality.

The realisation of a large, federated, interoperable e-infrastructure forscholarly communication is coming closer and closer, and neighbouring fieldshave a unique opportunity to move together to deliver key services to theirscientific communities.

References

1. R. Heuer, A. Holtkamp and S. Mele, Innovation in Scholarly Communication:Vision and Projects from High-Energy Physics Info. Ser. and Use 28 (2008) pp. 83–96,arXiv:0805.2739v1, doi:10.3233/ISU-2008-0570

2. http://inspirebeta.net3. http://arXiv.org4. http://www.slac.stanford.edu/spires/5. http://invenio-software.org/6. L. Addis http://www.slac.stanford.edu/spires/papers/history.html

http://arxiv.org/abs/0805.2739

http://dx.doi.org/10.3233/ISU-2008-0570

http://inspirebeta.net

http://arXiv.org

http://www.slac.stanford.edu/spires/

http://invenio-software.org/

http://www.slac.stanford.edu/spires/papers/history.html


7. A. Gentil-Beccot et al., Information Resources in High-Energy Physics: Surveyingthe Present Landscape and Charting the Future Course, J. Am. Soc. Inf. Sci. Technol.60 (2009) pp. 150–160, arXiv:0804.2701v2, doi:10.1002/asi.20944

8. http://indico.cern.ch/event/116119. http://www.loc.gov/standards/marcxml

10. http://www.openarchives.org/OAI/openarchivesprotocol.html11. http://www.math.union.edu/~dpvc/jsMath/12. http://www.orcid.org13. http://www.openarchives.org/ore/14. http://www-library.desy.de/akw/HEPont.rdf15. Reference Model for an Open Archival Information System (OAIS),

http://public.ccsds.org/publications/archive/650x0b1.pdf16. http://adswww.harvard.edu/

http://arxiv.org/abs/0804.2701

http://dx.doi.org/10.1002/asi.20944

http://indico.cern.ch/event/11611

http://www.loc.gov/standards/marcxml

http://www.openarchives.org/OAI/openarchivesprotocol.html

http://www.math.union.edu/~dpvc/jsMath/

http://www.orcid.org

http://www.openarchives.org/ore/

http://www-library.desy.de/akw/HEPont.rdf

http://public.ccsds.org/publications/archive/650x0b1.pdf

http://adswww.harvard.edu/

Part V

Tools and Techniques

Mathematical Communication and Representationin a Virtual Learning Environment

A Case Study

César Córcoles and Antonia Huertas

Computer Science Department, Open University of Catalonia (UOC)Rambla Poblenou, 156, 08018 Barcelona, Spain

http://www.uoc.edu/

Abstract. Abstract. At an exclusively online university such as the UOCthe necessity for communicating mathematics in the web is pressing. In anenvironment that does not allow for face to face communication, thingsimplicitly communicated when using a blackboard, such as the canonicalverbalization or handwriting of formulae, are lost and become a bigobstacle. Also, the editorial process for the creation of learning/teachingresources is suited for a generalist approach and, consequently, needssuch as those presented by formula typesetting, especially for web-basedmaterials, are not deemed a priority. In the last two years a series ofinnovation projects and initiatives have been set off in the UOC in orderto improve the situation: the use of LATEX and MathML standards inwriting web resources, the use of LATEX, MathML and other technologiesin verbalization and locution of formulae, and the study of currentpossibilities for mathematical handwriting recognition.Key words: mathematical communication and representation, eLearning,mathematical markup standards, web publishing, automatic verbalization

1 IntroductionAt a virtual learning institution such as the UOC (http://www.uoc.edu/), anexclusively online university, the necessity for communicating mathematicsonline is pressing: students enrolled in mathematics-dense courses, mainly incomputer and telecommunication engineering degrees, have a need to learnmathematics, and in order to do so they have to overcome the barrier posed bymathematical formulae [1]. In an environment that does not allow for face toface communication, things implicitly communicated when using a blackboard(such as the canonical verbalization of formulae) and commonly considered a“non problem” for teaching suddenly become a big obstacle. The detection ofthese difficulties, plus the perceived advantages digital resources could bringto the teaching/learning process led us to start exploring the available meansto improve digital communication of mathematics in our learning environment.Different aspects make that a hard to overcome hurdle.

– The virtual learning environment has been in place since the birth of theinstitution fifteen years ago, and only recently an initiative to adhere to theOKI project has reached usable state, allowing for the use of tools widelyavailable for other learning management systems.


http://www.uoc.edu/

http://www.uoc.edu/



96 César Córcoles, Antonia Huertas

– The institution has a strong, homogeneous pedagogical model, which hasa list of strong advantages, but also means that the needs of subjects withspecial needs, both pedagogically and technologically, such as mathematics,will not automatically gain full support. Students, even those enrolledin technological degrees, do not necessarily have an inclination towardstechnology, so any “different” solutions they do not usually see in otherclassrooms is met with distrust or even rejection by a number of students.Additionally, a significant amount took their last mathematics course insecondary a long time ago and, as a result, common mathematical languagehas been forgotten. This situation is especially problematic when studentsare adults with professional experience, with not much time and withinsufficient mathematical background who can only study at a distance [1],the typical student profile in an online distance university and more andmore in the life long learning paradigm.

– The editorial process for the creation of learning/teaching resources is,again, suited for a generalist approach and, consequently, needs such asthose presented by formula typesetting, especially for web-based materials,are not deemed a priority.

These factors lead to a search for off-the-shelf, easy solutions with a lowcost and an even lower impact on the learning process, attacking the needs ofusers with very varying degrees of technological and mathematical savvy, butespecially addressing those with a lower profile in both fields [2].

As the internet has become a medium for most activities, not only web-based teaching and learning [3], the Web 2.0 [4] and the future expectations forthe Semantic Web have an effect on almost all of them. Essentially, we considerthe web as a platform where software applications, rather than documents,live; where these software applications are designed to harness “collectiveintelligence” and effectively move from a developer-centric point of view to auser-centered one. According to that user-centered model, the user’s data isthe most important element in every transaction, and developers allow userstotal or almost total control of their assets. One very meaningful cause/effectof that is the extreme lowering of the cost of access to very sophisticatedresources to have a strong presence on the web. The advent of lightweightcontent management software allows the average web user or organizationto publish quality information in a comfortable and efficient way. This leadsnaturally to a digitalization of contents and, in the case of mathematical e-learning, to the building of a digital mathematics library, containing resourcesof all kinds, both for research and educational purposes, and covering all rangesof mathematical sophistication.

For mathematical educational resources, though, the situation is not asadvantageous [5]. Firstly, mathematical language requires a set of tools which,at the moment, are not sufficiently widespread. Thus, publishing web pagescontaining formulae marked in the MathML presentation standard doesn’tautomatically mean that most users will be able to read those web pages, anda number of alternative methods must be put in place. Secondly, authors and

Mathematical Communication and Representation in a VLE 97

editors of educational resources—as we have already mentioned for users—arenot necessarily tech proficient, so content creation tools not aiming for the baseline, in our experience, run a big risk of not being adopted, leading to a lack ofdigital resources in the library, creating a sort of digital divide for mathematicseducational resources.

In our particular case, moreover, educational materials are produced througha process that has been developed for all subjects at the University and thendelivered through the institutional Learning Management System (LMS). Noneof the two phases is particularly suited to mathematical content. This content,though, must be delivered in an adequate way to students without breakingneither the technological environment nor the pedagogical model. Thus, wehave looked for a number of solutions, that, while far from optimal in manyaspects, are currently working and allow for acceptable academic performance.

We will now present the solutions we have developed and used grouped intwo main categories: those for mathematical content delivery on the web andthose for the verbalization of mathematical formulae: how to write, and thenhow to read, mathematics on the web.

2 Writing Mathematical Content on the WebMathematical content delivery has gone in the last thirty years from “chalk onblackboard” and manual typesetting to TEX and LATEX typesetting—which, withthe automatization and precision it brought about, meant a revolution—andis currently moving to a web based model [6]. In recent years, research hasfocused on the semantic aspects of content, as exemplified by the OMDocmarkup document for mathematical documents [7] or the ActiveMath webbased learning environment [8]. Semantic information associated to digitalcontent is a necessary step for added value services. In the field of mathematicseducation, one added value is the verbalization of formulas, not only for visuallydisabled people but also for long-life learners who have disconnected the visualrepresentation of a formula and its verbalization. Other added value servicesmay include graphical representation of a math formula or web searchingin semantic-based databases, as exemplified by the NIST Digital Library ofMathematical Functions [9].

MathML, at present, seems to be the only viable alternative for contentdelivery if no means are available to develop a custom solution. However, itis not without its problems, especially in our context. Being a markup-denselanguage, it is not as easy to write or understand without the use of computersoftware as LATEX, for example, can be. When producing MathML content, thatmeans that most of it will not be authored directly in MathML: the authorwill either use some kind of WYSIWYG editor, provide LATEX content to betranslated, or provide the output of a computer algebra system as input. Thefirst two cases can be the cause of semantically erroneous but “visually correct”markup. Also, when dealing with an editorial process that is under a severeworkload and for which mathematical content is a very small fraction of the




work, asking editors to use new technologies can be a source of friction andcostly errors in the process.

2.1 Web Based Mathematical Resources

In this section we present our experience creating an interactive web basedlearning material [10] designed for a calculus course in the computerengineering studies.

This material uses the Wiris software [11], an on-line Computer AlgebraSystem, which allows mathematical calculations on-line. This software waschosen because it offered some features that made it more suitable to ournecessities than other existent commercial software, in particular its availabilityof both on-line and local versions (so students should be able to work fromany computer connected to Internet, which is one of the main features of UOC.It is also a multilingual tool that allows mathematics computing to be donein the different languages (Spanish or Catalan in our case), and it allows usconstruction of interactive learning exercises. The learning material we aregoing to present, although impregnated with the use of Wiris, was designedand created to be used with any other software of similar features.

When that material was designed in 2006, perhaps the most widespreadtool to export mathematical notation to the web was the one used in Wikipedia,Texvc [12], transforming standard mathematical LATEX notation into PNGgraphic files. Other available tools were TeX4ht [13], Hermes [14], TtM [15]or blahTeX [16]. They all work in approximately the same way: they takethe standard text and convert it into (X)HTML, and mathematical formulaeare converted into either graphic files or MathML mark-up. Thinking about“forward-compatibility”, and worrying about the inevitable loss of informationwhen converting mathematical mark-up into images, we chose the tex4mozLATEX package to generate XHTML plus MathML. At the time Firefox was inits 1.5 release, the first one with native MathML support (depending on theoperating system, some additional typography downloads were necessary), andsupport for Internet Explorer was in the form of a freely available plug-in [17].

The conversion process was not trivial: tex4moz was still in development,and while it worked remarkably well, it had some troubles, if non-standardpackages were used or if LATEX markup was not correct. Errors in thegenerated XML document were hard to correct, because of the working ofXHTML+MathML, which won’t render non-valid documents and providing nofeedback about the nature of the error, and extensive use of the equation editorin Amaya [18] or other such tools was required.

From a developer’s point of view, Wiris takes the form of a Java applettaking its parameters in the form of MathML commands. Neither the developernor the students need to have any knowledge of MathML, as the applet providesan easy to use editor and menu based tools.

As a result we obtained a simple interface, with acceptable mathematicalnotation and easy navigation. Furthermore: where in a traditional material wewould have an exercise, we were able to provide a link to some interactive






material developed with Wiris providing an infinite number of exercises. Wherein a traditional book we would have a figure and a written explanation, herewe were able to offer the possibility of a dynamic experiment to practice theconcepts.

As a result we obtained a simple interface, with acceptable mathematicalnotation and easy navigation. Furthermore: where in a traditional material wewould have an exercise, we were able to provide a link to some interactivematerial developed with Wiris providing an infinite number of exercises. Wherein a traditional book we would have a figure and a written explanation, herewe were able to offer the possibility of a dynamic experiment to practice theconcepts.

Fig. 1. Example of the UOC-Spanish language version of the web based material


2.2 Mathematical Content and EmailOf secondary interest for the creation of a digital mathematics library, butessential for mathematical e-learning, is the use of mathematical formulae inonline communication. As we have mentioned, we are bound by the institutionalLMS, with its own HTML based mail solution, which doesn’t allow the use ofMathML presentation markup.

That had severely limited the ability of students and faculty to use mathe-matical formulae in communication, forcing users to resort to verbalization offormulae or the attachment of scanned handwritten materials or documents pro-duced by tools with WYSIWYG formula editors, mainly Word and OpenOfficeWriter. These solutions were slow and awkward, and represented an obstacle tocommunication. In informal inquires we found that those barriers were keepingstudents from presenting doubts they had in the Virtual Learning Environment.

Once the need was detected we decided that the best solution would be aplug-in for the webmail platform, which would be made available progressivelyto those students enrolled in mathematical courses. Another design constraintthat had to be taken into account was that it should be as transparent as possibleto those users who didn’t want to adopt the solution. The chosen solution wasa pseudo-LATEX language which could be used with minimal training and wasreadable even without the plug-in rendering the formulae.

We observed from the very first pilots that the amount of communication inthe classrooms involving formulae increased significantly.

3 Reading Formulae on the Web: Verbalization and SpeechTools

As previously stated, the reading of mathematical formulae, while trivial forstudents coming out straight from secondary education and even those whodon’t but are learning in a traditional blackboard environment representsa problem for those who took their last mathematics course long ago andnow don’t have an instructor reading aloud the formulae as she writesthem. Communication, understanding and memorization efforts are sensiblyhampered, thus, for a sizable amount of students.

While we were especially worried about the impact of not being able toverbalize formulae on learning performance, when talking about a future digitalmathematics library the focus would be on the accessibility of the contents inthe library, both for people with sight disabilities and those whose disabilitycomes from lack of an adequate and current education.

Two different approaches have been taken to solve the problem. On onehand we have Rodolfo, a small “formula repository” containing formulae andhuman read MP3 audio snippets for them and, on the other hand, a projectdeveloped by Maths for More to take MathML (presentation) and LATEX formulaeembedded in XHTML pages and automatically produce a transcription for eachof them.

Rodolfo is a small, basic and simple web-based database storing formulae,encoded either in LATEX or in MathML presentation, plus their different possible


Fig. 2. Example of the pseudo-LATEX language and visualization in the UOCmail platform

readings in any language and MP3 files of those readings. The project was bornof the need to embed those audio readings in web based materials to improveacademic performance, after detecting a number of students having troublewith them. Thinking about reuse, a very simple framework was created so files


Fig. 3. Information from server logs regarding MP3 use for the differentformulae in a learning material. Spikes correspond with new terms appearingin the particular formula, proving the need students have of a verbalizationtool.

could be retrieved with moderate ease. At the moment, only LATEX search hasbeen implemented, not semantic search.

Besides being a useful tool as it is, Rodolfo seems to be in line with effortssuch as the “notation census” hosted by the Math-Bridge wiki. Also, it shouldnot be extremely difficult to build it into a “formula repository” [19] whichcould also link to related definitions or content in a semantically meaningfulway.

A completely different approach to the same problem is presented by the(currently nameless) verbalization tool developed by Maths for More: it isan automatic tool (server-based, developed in Java) scanning web pages forLATEX and/or MathML presentation formulae and generating automatically atext transcription [20] (which could then be sent to a Text-To-Speech engine).Obviously, trying to extract meaning from presentational markup such as theone available is bound to present ambiguous situation. Luckily, the scope ofevery single webpage is limited, so we can resort to adding a meta tag to thepage indicating the domain it covers so that the tool can use that knowledgefor disambiguation.

4 Final Remarks and Future WorkAdvances in web-based mathematics education depend heavily on thetechnologies to communicate the language of mathematics on the web and thespecific contexts and audiences we are targeting. We have presented experiences



taken place in an exclusively online university such as the UOC where thedifficulties in the development and deployment of these technologies have adeep effect on the technical studies which need to use mathematical content.

A few years ago mathematics courses at the UOC were delivered ashandbooks, taking no profit of the advantages presented by hyperlinking,multimedia and interactivity. While we cannot say that we are anywherenear building a digital library of mathematics, steps have been taken towardsthe digitalization of every mathematics related course at the institution.Some of those steps have allowed the use of resources such as a simpleComputer Algebra System, allowing us to focus more on the meaning and theunderstanding, and less on the memorization of algorithms by students. Wehave also seen how digitalization—and keeping or adding as much semanticinformation as possible—has allowed for verbalization, leading to betteraccessibility and communication among students and teachers, which wasalso helped by adding a simple, transparent tool to help students with formulawriting. In the future, linking to resources such as the NIST Digital Library ofMathematical Functions should allow for a much richer and more productivelearning experience by students.

There is still a lot of work to be done so that we can say that the transmissionand consumption of mathematical content is not a problem at an onlineUniversity such as UOC and that our online content can be called a real digitallibrary. In particular, besides going further into each of the lines we haverecently started looking into available solutions for the input of formulae usinghandwriting, using digitizing tablets or tablet PCs as input devices and softwareto record that input and, if at all possible, convert it into semantic content.

As a final remark, we state our strong opinion that, much in the sameway widespread LATEX adoption depended on the availability of easy, robustauthoring tools, we predict the same thing will happen with web basedmathematics. In our case, we are at a point where the web as a platform is finallyreaching enough maturity as to represent a viable publishing platform with theadoption of HTML5 (plus technologies such as MathML but also SVG or Canvas,for example). The mathematics community has always been at the forefront inweb specification writing, with MathML being the first the first XML languagerecommended by the World Wide Web Consortium in 1998, with continuingwork today to improve it and adapt to the times. And thus we believe that, inorder to see that standard take the current mathematical library effectively intothe digital era, a bigger stress must be put in the development of tools that willallow both current LATEX users and newcomers to use them efficiently. Also, astrong effort must be made so that potential users know available tools and canlocate them easily and find their corresponding documentation and tutorials.

Acknowledgments. This work has been partially supported by three UOCeducational innovation projects the years 2007, 2008 and 2010, respectively, andby the eLearn Center.




References

1. Juan, A., Huertas, M., Steegmann, C., Córcoles, C. and Serrat, C.: MathematicalE-Learning: state of the art and experiences at the Open University of Catalonia.International Journal of Mathematical Education in Science and Technology, vol. 39,no. 4, pp. 455–471 (2008).

2. Kahn, P. and Kyle, J. (eds.): Effective learning & teaching in Mathematics & itsapplications Kogan Page (2002).

3. Alexander, B.: Web 2.0: A new wave of innovation for teaching and learning?EDUCAUSE Review, 41(2), 32–44 (2006). http://www.educause.edu/apps/er/erm06/erm0621.asp?bhcp=1

4. O’Reilly, T.: What is Web 2.0. O’Reilly’s [blog] (2005). http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html

5. Miner, T. and Topping, P.: Math on the Web: A Status Report – Focus: DistanceLearning, (2001). http://www.dessci.com/en/reference/webmath/status

6. Pajo, K. and Wallace, C.: Barriers to uptake of web based technology by universityteachers, J. Dist. Edu. 16, pp. 70–84 (2001).

7. Kohlhase, M.: OMDoc: Towards an Internet Standard for the Administration,Distributions and Teaching of Mathematical Knowledge. Proceedings of ArtificialIntelligence and Symbolic Computation. Springer LNAI (2000).

8. Melis, E., Andrès, E., Büdenbender, J., Frischauf, A., Goguadze, G. Libbrecht, P.,Pollet, M and Ullrich, C.: ActiveMath: A Generic and Adaptive Web-Based LearningEnvironment. International Journal of Artificial Intelligence in Education, no. 12, pp.385–407 (2001).

9. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov10. Bringslid, O. Mathematical e-learning using interactive mathematics on the Web,

Eur. J. Eng. Edu. 27, pp. 249–255 (2002).11. ’Maths for More, http://www.mathsformore.com/12. Texvc, http://en.wikipedia.org/wiki/Texvc13. TeX4ht: LaTeX and TeX for Hypertext, http://www.cse.ohio-state.edu/~gurari/

TeX4ht/14. Hermes – a semantic XML e-publishing tool for LaTeX authored scientific articles,

http://hermes.roua.org/15. TtM, a TeX to MathML translator, http://hutchinson.belmont.ma.us/tth/mml/16. Blahtex, http://www.blahtex.org/index.php?page=home17. MathPlayer, http://www.dessci.com/en/products/mathplayer/18. Amaya Home Page, http://www.w3.org/Amaya/19. Minguillón, J., Huertas, M. A., Juan, A. A., Sancho, T., Cavaller, V. (2008). “Using

learning object repositories for teaching Statistics”. Proceedings of the FirstWorkshop on Methods and Cases in Computing Education. Salamanca: pp. 53–61. ISBN: 978-84-691-8558-2.

20. Sancho, T., Córcoles, C., Huertas, M. A., Pérez, A., Marquès, D., Villalonga, J.(2008). “Automatic Verbalization of Mathematical Formulae for web-Based LearningResources”. In: Remenyi, D. The Proceedings of the 7th European Conference one-Learning. Academic Publishing Limited. pp. 405–414. ISBN: 978-1-906638-23-8.

http://www.educause.edu/apps/er/erm06/erm0621.asp?bhcp=1

http://www.educause.edu/apps/er/erm06/erm0621.asp?bhcp=1

http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html

http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html

http://www.dessci.com/en/reference/webmath/status

http://dlmf.nist.gov

http://www.mathsformore.com/

http://en.wikipedia.org/wiki/Texvc

http://www.cse.ohio-state.edu/~gurari/TeX4ht/

http://www.cse.ohio-state.edu/~gurari/TeX4ht/

http://hermes.roua.org/

http://hutchinson.belmont.ma.us/tth/mml/

http://www.blahtex.org/index.php?page=home

http://www.dessci.com/en/products/mathplayer/

http://www.w3.org/Amaya/

Producing MathML with Tralics

José Grimm

Institut National de Recherche en Informatique et en Automatique2004 Route des Lucioles

06902 Sophia Antipolis (France)[email protected]

Abstract. We describe here how Tralics can be used to convert LATEXdocuments into XML or HTML. It uses an ad-hoc DTD (a simplificationof the TEI), but the translation of the math formulas is conforming to thepresentation MathML 2.0 recommendations. We explain how to run andparametrize the software. We give an overview of the various MathMLconstructs, and how they are rendered by different browsers.

1 Introduction

Tralics is a free software, written by the author, that converts LATEX documentsinto XML; initially designed for the Inria’s annual activity report, it is usedby CEDRAM [1] or Zentralblatt MATH. It can be downloaded from the Web1,where you can find some documentation and examples2.

The software uses the same rules as TEX for analyzing the source document,and converts it into a list of tokens; these tokens may be stored into token listsor macros, as in TEX, or converted into XML elements (which play the somerole as boxes in TEX). The whole XML tree is dumped into a file at the end ofthe run (after producing an index, a bibliography, and checking that labels arecorrect). The structure of this tree is inspired by the TEI DTD, for instance, afootnote is translated as a <note> element, with an attribute pair place=’foot’. Itis possible to change the names of almost all elements or attributes, either in aconfiguration file, or in the TEX source, and Tralics makes no attempt to checkthat the XML tree is conforming to the DTD declared in the header.

Tralics is one of the five converters examined in [6], which states somestrengths and weaknesses. The authors indicate a success rate of only abouttwo percent, and an average of 50 undefined commands per document. Thereare, in effect, frequently used commands (for instance \author, \email), thatare defined in class or style files, not yet handled by Tralics. This is a minorinconvenience, since it is rather easy to define these commands in order toproduce a <author> or <email> element. Some very useful packages like xypic,pgf, pstricks are unknown to Tralics. These are big and complicated packages,and all assume that there is a PostScript-like renderer that will draw the lines,

1 http://www-sop.inria.fr/apics/tralics2 see also http://www-sop.inria.fr/apics/gaia









http://www-sop.inria.fr/apics/tralics

http://www-sop.inria.fr/apics/gaia



106 José Grimm

arrows, etc. This is incompatible with the HTML model. One solution would beto use SVG for the graphics, but the W3C example mixing SVG and MathMLdoes not render correctly on my machine.

On the other hand, Tralics is designed to run in batch mode. In case oferrors, a message is printed in the transcript file, and translation continues (thejob is aborted after 5,000 errors); in order to help debug, each error producesan <error> element in the XML file. For instance, the translation of \foo\baron line 13 will be

<error n=’\foo’ l=’13’ c=’Undefined command’/><error n=’\bar’ l=’13’ c=’Mathonly command’/>

2 Running Tralics

Normally, you start with a source file, say foo.tex, and run Tralics, this givesyou a file foo.xml. There is an interactive mode, where Tralics reads charactersfrom the terminal, and prints all MathML expressions on the screen. There aremany ways to parameterize the output of Tralics. First of all, Tralics reads thefile foo.ult; you can put here some LATEX definitions (that are not used when youcompile the file with LATEX). If your document uses a class, say c1.cls, Tralicsreads c1.clt instead, and if your document uses a package, say p2.sty, Tralicsreads p2.plt instead; if these files are missing, they are silently ignored. Onlystandard classes and packages have an associated clt or plt file.

In [5], we explain how the XML documents produced by Tralics can beconverted into a sequence of HTML pages, via an XSLT processor, lots ofstyle sheets, and an external tool that converts non-HTML pieces (syntax treesand math formulas) into images (an example is Inria’s activity report3, wherepassiveTEX converts XML to PDF, [2]). Reasons for converting math formulasinto images include the fact that few browsers render MathML, that theyrequire specific DTDs or file names, and that the rendering is not always legible.In this paper, we assume on the contrary that the document will be interpretedby a MathML-aware browser.

Editors like CEDRAM or Zentralblatt use it to convert the meta-data(typically the summary) of a paper into XML, and insert this into an HTMLpage with other stuff; in Figure 1 you can see a MathML formula in an articletitle (more complex formulas appear in the abstracts, or reviews; some are indisplay style).

There is a special feature in Tralics in that a math formula can be translatedeither as a LATEX-like formula or a MathML formula; the following definesa command \both with one argument that evaluates it in both modes (thisfeature is used by CEDRAM), and puts the result in an element named A andB respectively.

\def\Va#1{\xbox{A}{#1}}

3 http://www.inria.fr/rapportsactivite/index.fr.html












http://www.inria.fr/rapportsactivite/index.fr.html

Producing MathML with Tralics 107

Fig. 1. Example of MathML in Zentralblatt

\def\Vb#1{\xbox{B}{\@nomathml=-1 #1}}\def\both{\@reevaluate\Va\Vb}

The translation of

\def\square#1{(#1)^2}\both{$\square{a+b}$}

will be

<A><formula type=’inline’><math xmlns=’http://www.w3.org/1998/Math/MathML’>

<msup><mrow>

<mo>(</mo><mi>a</mi><mo>+</mo><mi>b</mi><mo>)</mo>

</mrow><mn>2</mn>

</msup></math>

</formula></A><B><texmath textype=’inline’ type=’inline’>(a+b)^2</texmath></B>

You can change the name of the elements or attributes used by Tralics,either on the command line, or in a configuration file, or in the source file; thisapplies to almost everything but the content of math formulas (the content ofthe <math> element is always conforming to the MathML recommendations).For instance, if you say

\ChangeElementName*{mathmlns}{foo}\ChangeElementName{math}{mml:math}\ChangeElementName{formula}{F}\ChangeElementName*{type}{T} ...

this will change the name of the elements <math> and <formula>, or of theattributes type, etc. With these declarations, the translation of $x$ will be


108 José Grimm

<F T=’I’><mml:math xmlns=’foo’>

<mi>x</mi></mml:math>

</F>

All math formulas are wrapped into a <formula> element that has variousattributes: the name of the environment (if any), the value of the equationnumber (value of \theequation), unique ID for referencing, tags, etc. Notethat at most one label can be attached to a formula (this limitation will beremoved in the future); but using multiple tags is problematic, because of thebad rendering of the <mlabeledtr> element by most browsers. For this reason,the translation of the \tag command can be either a part of the math expression,or an attribute of the formula. The translation of

\begin{equation} x\end{equation}\begin{equation*} x \tag{1-2}\end{equation*}\begin{math} x\end{math}

will be

<F id-text=’1’ id=’uid1’ textype=’equation’ T=’D’><mml:math mode=’display’ xmlns=’foo’> ...</mml:math>

</F><F textype=’equation*’ T=’display’ tag=’(1-2)’>

<mml:math mode=’display’ xmlns=’foo’> ...</mml:math>

</F><F textype=’math’ T=’I’><mml:math xmlns=’foo’> ... </mml:math></F>

3 Examples

We shall show in this section some screen shots; they were obtained usingdifferent browsers, essentially Amaya and Firefox on MacIntosh. We carefullyfollowed the instructions for installing the right fonts in 2007: we used TEX’sComputer Modern fonts, set the user preferences (asking Firefox to use CMfonts) on Linux, and installed the Mathematica 4.1 fonts instead on the Mac.Our HTML example file (containing all formulas found in the TEXbook or

Fig. 2. Nested roots (Firefox 2007 & 2010))


LATEX companion, [4]) was converted to PDF (via the Print button) and savedsomewhere. Since then, the machines have been replaced by new ones, and thesnapshots were taken with default settings. The Firefox web page recommendsto use the STIX fonts (released May 28, 2010), and to reset the user preferenceitem. This has no effect on Amaya or Opera, but corrects a few bugs for Firefox.

We show on Figure 2 the rendering of nested square roots. Defaults fontswere used on the right; installing Stix fonts improves the situation on Mac (butnot on Linux). Consider another example

\[ \left(\left[\left\lbrack a\left\{\left\lbrace\left\lfloor b...

\rbrack\right]\right)\]

Fig. 3. Various delimiters (Firefox Mac 2007, Mac 2010, Linux 2010)

The rendering is on Figure 3. There was a mistake in the source of the 2007edition of the document (there was \langle instead of \rangle). The bugs havebeen corrected in the Stix fonts.

Fig. 4. Delimiters and Roots (Amaya)

We show on Figure 4 the rendering by Amaya. One can notice that themath formula is very near to the surrounding text. The square root signs arebeautiful; the delimiters are unreadable: the parentheses, brackets and verticallines all look the same; the forward slash, backward slash and double verticalrule are not vertically centered; braces are much too small. Magnifying the fontdoes not help distinguish ( from [.

110 José Grimm

Fig. 5. Large operators (Firefox 2007, Firefox 2010, Amaya)

We show on Figure 5 the placement of indices for large operators (a sumand an integral) in display style, in text style, and in display style with non-standard placement. Note that the new Firefox uses the same sum for all threeoccurrences (whether or not Stix fonts are installed), while the sizes of theintegrals change; the placement of the last π/2 is wrong in Amaya.

Fig. 6. Stacking (Firefox 2007, Firefox 2010, Amaya)

Figure 6 corresponds to the following example

$\displaystyle \sum\limits_{{\scriptstyle1\le i\le n\atop\scriptstyle 1\le

j\le q}\atop \scriptstyle 1\le k\le r}a_{ij}b_{jk}c_{ki}$

The math formula consists of a <mstyle> element, so that the formula shouldbe rendered in display style; this means that a large sum sign should be used.The sum is followed by three <msub> elements, letters a, b and c should have anormal size, letters i, j and k should be smaller. The index in the sum is special;Tralics chooses to use <munder> in display style. The index is formed of threelines, one atop the other, and we use a <mfrac> with zero line thickness. Eachline has its style that says that i, j and k should have the same size everywhere.They are too small in Amaya, and sometimes too big in the old Firefox version.Note that the vertical spacing is irregular in Firefox, too big in Amaya. We havealso shown the bounding box in the Amaya case. The rendering of MathML inFirefox and Amaya seems quite good, contrarily to Opera: Figure 7 shows thatsome parts are correctly rendered (the square roots), some are quite wrong (themiddle part should be the same as in Figure 3), and some are partially wrong(the fraction rule should be invisible on the right).



Fig. 7. Examples with Opera (version 10.10)

4 The Characters

When Tralics sees a dollar sign, or any other command that indicates the start ofa math formula, it reads all characters up to the end of the formula, expandingwhat can be expanded, evaluating what can be evaluated, sometimes signalingerrors. The result is a nested list of math tokens, and a two-pass algorithm isused to convert this into a MathML element. Example

$ \{x,~x^{10} > \alpha\beta\} \cup \left[ xy\right]$

The first pass converts atoms into atoms, the second pass recursively combinesthese objects. To each of the three tokens \alpha, \{, and x is associated a classand a value. In TEX, the numbers associated to the first two tokens are 010B or4266308, the class is respectively 4 or 0, the value is either 10B or the pair (266,308) for \{; a pair is required here since the glyphs associated to the brace arein two different fonts (typically cmsy and cmex); this is a strange design. Notethat there are at most 212 = 4096 different possible values, which is a very smallnumber. For us, the value will be any Unicode code point. The objective of thefirst pass is to replace these numerics value by atomic XML elements (calledtoken elements), leaving the class unchanged. The second pass builds the treelooking only at the classes.

The translation of an atom depends on the class, it is a <mi>, <mn>, <mspace>,<mo>, if this atom is a letter, a digit, space, or something else; in some casesa sequence of atoms is converted to a single token element. A simplifiedrepresentation of the previous formula could be “liosiC(nn)oiiro(ii)”, where themeaning of i, n, s and o, is clear, parentheses indicate nesting, C indicates acommand interpreted later, l and r indicate opening and closing delimiters.

A Unicode code point (see [7]) is integer less than 17× 216. Each sequence of216 characters is called a plane; the first plane contains all European characters,a lot of mathematical characters, and browsers like Firefox have glyphs for allcharacters in this plane. Any positive number that fits on 27 bits is consideredby Tralics as a character, thus $\char"7FFFFFF$ is accepted, but produces anillegal Unicode character. If you specify UTF-8 as current input encoding, thenthe sequence of 3 bytes E2 82 AC will be read as the character U+20AC; itcan also be entered as ^^^^20ac; it corresponds to the Euro sign. Any positivenumber that fits on 16 bits is a normal character (it has a category code, a UCcode, can be part of a command name, etc.), it can be entered via the four-hat



112 José Grimm

mechanism, is part of the first plane. Note that TEX extensions like LuaTEXconsider any Unicode character as a normal character, that can be entered via afive-hat or six-hat mechanism.

Fig. 8. Examples of characters

Consider for instance the characters ^^^^301a or ^^^^27e6; both are named“left white square bracket” by the Unicode consortium, the latter one beingqualified as “mathematical”, they look quite the same in Firefox (see Figure 8).When used in math mode, the translation is a <mi> element; you can use thecommands \mathco or \mathci if you want a <mo> or <ci> instead (there are sixpossibilities). One can note that the prime sign is positioned too high, especiallyfor the letter g (this is because the prime sign is an active character in mathmode and is converted into a superscript).

At position 1Dxxx (in the second Unicode plane), you can find a lot ofvariants of mathematical characters (bold, Fraktur, bold Fraktur, double struck,etc.), but some characters appear in plane one (see Figure 8). A complete set offonts (like the Stix fonts) is required in order the see them all.

Most useful characters can be used in Tralics via a name, for instance\texteuro produces the Euro sign. The command \llbracket can be used toproduce J. The translation is

<mo>&LeftDoubleBracket;</mo>

The entity used in the previous formula is defined in the file mmlalias.ent(distributed with the MathML DTD, and Tralics). You can easily convinceTralics to use the character U+301A instead of the entity.

The translation of the formula is

<mrow><mrow>

<mo>{</mo><mi>x</mi><mo>,</mo><mspace width=’3.33333pt’/><msup><mi>x</mi> <mn>10</mn> </msup><mo>></mo><mi>α</mi><mi>β</mi><mo>}</mo>

</mrow><mo>∪</mo><mfenced separators=’’ open=’[’ close=’]’>

<mi>x</mi><mi>y</mi></mfenced></mrow>





You can notice in this example that Tralics uses a single <mn> for the wholenumber, and considers xy as two consecutive identifiers. Any math formula thatcontains at least two objects is enclosed in a <mrow> or a <mfenced> element (wemake the assumption that <mrow> behaves the same as a <mfenced> withoutseparators, with empty open and closing delimiters). When the <mfenced>element has only one item, the separators attribute is not indicated. A fence isadded whenever you use \left and \right; these commands must be followedby a delimiter (the list of delimiters is currently built-in). An <mrow> has beeninserted between the positions marked l and r on the symbolic representation:On Figure 9 you see a variant of the formula (where an exponent has beenadded to make brackets greater), and the same formula, where the <mrow>shave been removed.

Fig. 9. Effect of <mrow> on the size of stretchable operators

Some operators are of variable size; we have seen examples of sums andsquare root previously. Some symbols are vertically extensible (for instancebraces and brackets) their height is the height of the current sub-formula (the<mrow> or <mfenced>). In the example above, the height of the brackets is theheight of the formula [xy]. If you say \big(, Tralics silently ignores the prefixbefore the parenthesis; in the case of \big(a^2\bigg), it is clever enough tounderstand that the opening parenthesis matches the closing one and usesfences.

Consider the following input

$ \mathcal{AB1\alpha}\mathbf{AB1\alpha}\mathrm{AB1\alpha}$\mathfontproperty\mathcal=1\mathfontproperty\mathfrak=1$ \mathcal{AB1\alpha}\mathfrak{AB1}\mathrm{A}$

The translation of the first math expression is

<mi>𝒜ℬ</mi><mn V=’script’>1</mn><mi>α</mi><mi>𝐀𝐁</mi><mn V=’bold’>1</mn><mi>α</mi><mi> AB </mi><mn>1</mn><mi>α</mi>

Here V stands for ‘mathvariant’. The following points are to be noted. First,the operators are unaffected by font changes (there is no slanted plus, no boldminus). Currently, Greek characters have a fixed translation although Unicodeprovides upright or slanted, regular or bold, Greek letters. Tralics uses onlymathvariant for digits, even though variants of the characters exist in Unicode.




114 José Grimm

In many cases, the rendering is correct (see Figure 8, for an example of bolditalic characters). Translation of \mathrm is tricky. It should produce an upright(non-italic) version of the object. We assume that this is the default case fordigits. We also assume that is the default for a <mi> object that contains morethan one letter. The translation of the second formula is

<mi V=’script’>AB</mi><mn V=’script’>1</mn><mi>α</mi><mi V=’fraktur’>AB</mi><mn V=’fraktur’>1</mn><mi V=’normal’>A</mi>

You can see how \mathrm handles the case of a single letter: we use themathvariant attribute with the ‘normal’ value. The two magic lines between theformulas tell Tralics to use the mathvariant attribute, rather than characters thatare outside plane one. This is not not needed for Firefox when the Stix fontsare present.

5 Building Formulas

It is possible to obtain the following formula with Tralics

<apply><power/><apply><plus/><ci>a</ci><ci>b</ci></apply><cn>2</cn>

</apply>

by typing

\newcommand\Apply[2]{\mathbox{apply}{\mathbox{#1}{}#2}}$\Apply{power}{\Apply{plus}{\mathci{a}\mathci{b}} \mathcn{2}}$

Neither Amaya nor Firefox render the “content” part of the MathMLrecommendations, so that we consider here only “presentation” elements.

Details about the algorithms used by Tralics can be found in the report [3].The first pass converts a list of characters into a nested list of XML elements,together with some commands. A recursive algorithm is used to combine thepieces, and evaluate the commands. The formula of the previous section maybe represented as “liosiC(nn)oiiro(ii)”, that contain a single command to beevaluated, the hat sign.

Each expression has a level (1 for the main expression, 2 for indices, 3 forsubindices, etc), that can be changed in TEX via the use of commands like\scriptstyle (see Figure 6 for an example). The effective level is computedand <mstyle> elements may be inserted, while \mathchoice commands arereduced. Expressions of the form

$a\over {b \over c}$$a^{^b}_{c_d}$

are considered. On the first line, there is a basic TEX expression, discouragedby amsmath: one should use a prefix version, namely \frac instead. Theexpression on the second line is considered so good practice that there is noprefix equivalent. If there was one, it would look like:





\supsub{a} {\sup{}b} {\sub c d}

In fact, Tralics does this conversion. To each kernel, for instance a, one canassociate at most one subscript and at most one superscript (Note that Tralicshas currently no support for tensors, i.e., objects that can have multiple scripts,on the left or right). It is legal to start an expression with an underscore or hatbut not to finish it. The translation is

<msubsup><mi>a</mi><msub><mi>c</mi> <mi>d</mi> </msub><msup><mrow/> <mi>b</mi> </msup>

</msubsup>

The kernel can be an operator like a sum. The placement of the index maydepend on the mode (display or not); since Tralics knows the current mode,it used <munder> or <msub> (some MathML operators have a movablelimitsattribute, and in such cases, the placement of the index could be wrong afterall). We give here an example, where a, b and c are operators like

∑in display

style.

\def\X#1{\mathop #1\limits} $\X a ^{\X b ^2} _ {\X c _2}$<munderover>

<mi>a</mi><munder><mi>c</mi> <mn>2</mn> </munder><mover><mi>b</mi> <mn>2</mn> </mover>

</munderover>

There are several ways to put two objects one above the other. The <munder>and <mover> elements have a base object and a secondary object, which isgenerally smaller; <munderover> has two secondary objects; these elementscan be produced by attaching scripts to a kernel in some special cases. Thepreferred way is to use \underset or \overset. In LATEX you can also use\xrightarrow. Some commands like \bar or \hat produce accents (case wherethe element has an accent attribute). The effect of this attribute is often unclear.The recommendation says an accent is drawn closer to the base, and has anormal size. Both objects stretch horizontally. For instance, if you put text overan arrow, the width of the arrow is at least the width of the text; if you put abar or a brace over a formula, it covers the whole formula.

The <mfrac> element puts two objects one above the other, they are generallyseparated by a fraction rule; it can be obtained by the \frac command, orvariants thereof (the two objects are generally centered, of the same size; Tralicsaccepts \hfill command in math mode in order to change the alignment).

Arrays are handled by Tralics, as far as the result conforms to MathML.Horizontal alignment (rlc in the array preamble) is accepted, but columnseparators are ignored or rejected. This means that you cannot insert horizontalor vertical rules (there is a limited way of putting frame around a table






116 José Grimm

in MathML, but this is not yet implemented). Matrices and some mathenvironments are converted by Tralics into tables.

It is possible to adjust horizontal or vertical spacing by adding explicitspaces (for instance \mskip9mu inserts 5pt of white space) or using phantoms.

6 Mixing Math and Text

In TEX, you can insert any box into a math formula (for instance an image, areference to the bibliography, an external link, another math formula); this isnot possible in MathML. An example of code refused by Tralics is

\xymatrix{A\ar[rd]{^f}&B\\C&D}

Our hope is that, in some future, there will be more interaction between XMLstandards, and a combination of SVG and MathML will render both themathematical expressions and the arrows between them. A math expressionlike {x, such that x > 0} can be entered in TEX as

$\{x, \hbox { such that $x>0$} \} $

When Tralics sees non-math material inside a math formula, it provokes anerror, unless this expression can be interpreted as combination of text and math.The algorithm is a bit tricky, and error messages may be confusing. We allowcommands \hbox, \text, \mbox, and font changes inside them (they will bepartially honored); we also allow math formulas inside text, and they will berendered as math outside the text. The translation of the preceding example is

<mo>{</mo><mi>x</mi><mo>,</mo><mspace width=’4.pt’/><mtext>such</mtext><mspace width=’4.pt’/><mtext>that</mtext><mspace width=’4.pt’/><mrow><mi>x</mi><mo>></mo><mn>0</mn></mrow><mo>}</mo>

Consider now the following formula :→↗ +↖←+↙↘= �� entered as

$\lower .97ex \hbox{$\rightarrow$} \mskip-24mu \nearrow + \nwarrow\mskip-24mu \lower .97ex \hbox{$\leftarrow$} + \swarrow \mskip-2mu\searrow = \hbox{$\diagup\mskip-1mu\diagdown$}\lower.48ex\hbox{$\mskip-31mu\hbox to

5.85mm{\strut\hrulefill\strut}$}$

The horizontal arrows are vertically shifted via the \lower command. Thisis currently impossible in MathML, as there is no equivalent of vertical row,vertical space, vertical adjustment. It happens that correct placement can beachieved in Firefox by using a subscript. On Figure 10, you can see that thesame formula renders awfully in Amaya.

The right hand side of the equality contains a triangle, obtained by gluingtwo characters and using \hrulefill for the horizontal line. We have shownon the figure the rendering of the two characters by Firefox. As you can see, itis impossible to complete this into a triangle via an horizontal rule.





Fig. 10. A funny math formula, Firefox and Amaya

7 Conclusion

We have shown in this paper that Tralics is able to translate a great number ofmathematical expressions into MathML, and that the result is often correctlyrendered by browsers. There may be other uses of the translation (as feed tocomputer algebra systems, or for indexing purposes, data mining, etc.). We donot know how well the Tralics output behaves in these domains. In some cases,Tralics fails to correctly translate a document; it might be due to the use ofbig packages like xypic, or because people use non-math constructs into mathformulas. Some useful packages should soon be adapted to Tralics, includingpackages that provide access to fonts, whether or not the characters are in theUnicode standard. We also plan to increase flexibility, in different areas of thesoftware.

References

1. Thierry Bouche. When CEDRAM meets Tralics. In: Towards Digital Mathematics Library,pages 153–165, Masaryk University, Brno, 2008. http://dml.cz/dmlcz/702544.

2. David Carlisle, Michel Goossens, and Sebastian Rahtz. De XML à PDF avec xmltexet PassiveTEX. In: Cahiers Gutenberg, number 35–36, pages 79–114, 2000.

3. José Grimm. Converting LATEX to MathML: the Tralics algorithms. Research Report6373, INRIA, 2007.

4. José Grimm. Producing MathML with Tralics. Rapport de Recherche 6181, Inria,2007.

5. José Grimm. Convertir du LATEX en HTML passant par XML: Deux exemplesd’utilisation de Tralics. Cahiers Gutenberg, (51):25–55, 2009. October 2008, to appear.

6. Heinrich Stamerjohanns, Deyan Ginev, Catalin David, Dimitar Misev, VladimirZamdzhiev, and Michael Kohlhase. MathML-aware article conversion from LATEX. In:Towards a Digital Mathematics Library, pages 109–120, Masaryk University, Brno, 2009.http://dml.cz/dmlcz/702561.

7. The Unicode Consortium. The Unicode Standard, version 4.0. Addison Wesley, 2003.







Symbol Declarations in Mathematical WritingA Corpus Study

Magdalena Wolska1 and Mihai Grigore2

1 Fachrichtung 4.7 Allgemeine Linguistik, Universität des SaarlandesD-66041 Saarbrücken, Germany

[email protected] Computer Science, Jacobs University Bremen, D-28759 Bremen, Germany

[email protected]

Abstract. We present three corpus-based studies on symbol declaration inmathematical writing. We focus on simple object denoting symbols whichmay be part of larger expressions. We look into whether the symbols areexplicitly introduced into the discourse and whether the information ononce interpreted symbols can be used to interpret structurally relatedsymbols. Our goal is to support fine-grained semantic interpretation ofsimple and complex mathematical expressions. The results of our analysisempirically show the potential benefit of using larger discourse context inautomated disambiguation of mathematical expressions.

Key words: mathematical discourse, disambiguation of mathematicalexpressions, corpus-based analysis

1 Motivation

Semantic search in mathematical documents, in order for it to account for theirfull mathematical content, must necessarily provide ways of searching throughthe symbolic expressions which are part of mathematical discourse. Whilededicated approaches to formulae search do exist (see, for instance, [11,10] andreferences therein) they typically depend on semantically-oriented mark-up intheir internal representation of mathematical expressions; be it OpenMath [5]or Content MathML [4]. Recent years have therefore seen increasing effortstowards improving automatic creation of machine-readable semantics-enrichedmathematical documents [15].

Automatically inferring the semantics of a mathematical expression, both asa whole and of its constituent parts, is, however, a non-trivial task because ofthe infinite nature of the mathematical alphabet: new symbols may be invented,constructed from existing symbols, existing symbols may be typographicallyenriched to form new symbols, etc. All this possibly in a single document. Thereare of course certain conventions as to the usage of mathematical notation,

1 Correspondence to: Magdalena Wolska, Universität des Saarlandes, Fachrichtung 4.7Allgemeine Linguistik, Building C 7.2, Postfach 15 11 50, 66041 Saarbrücken, Germany;tel.: +49 681 302 4344






120 Magdalena Wolska and Mihai Grigore

general and specific to mathematical sub-areas, as well as prescriptive rules onhow to write mathematics (see, for instance, [9,12]) which mathematics’ authorstend to follow, however, automated interpretation of arbitrary mathematicalexpressions remains a challenging task.

We performed a quantitative analysis of a subset of the arXMLiv collec-tion [3] processed using LaTeXML [15], the state-of-the-art mathematical doc-ument processing architecture, and found out that approximately 41% of allthe parsed mathematical symbols have not been interpreted by the LaTeXMLgrammar (2,842,813 out of a total number of 6,872,419 mathematical symbols);where by “not interpreted” we mean that the grammatical role attribute in theinternal LaTeXML representation, the XMath role, has been set to unknown.

In our previous work [8], we showed that the local linguistic context, withinwhich mathematical expressions are embedded, provides a good source ofinformation for recognizing the denotation of mathematical expressions. Ourapproach, however, treats a mathematical term as a whole and attempts toidentify an object type to which the entire term refers.

In this paper, we present three corpus-based studies which are meant tocomplement our previous work and constitute a step towards compositionalsemantic analysis of symbolic expressions. We now focus on simple object de-noting symbols which may be part of larger expressions and ask, paraphrasingKnuth and colleagues, whether in actual mathematical papers “[a]ll variables[are] defined, at least informally, when they are first introduced” [9]. Certainlynot all of them are: certain notational conventions are taken for granted, espe-cially in academic scientific papers. They constitute part of what Clark callscommunal (in this case, professional) common ground [6]. Our question of inter-est is rather “how much” of the notation is left implicit. More specifically, inthe three studies described in this paper we were interested in the followingquestions:

1. To what extent are mathematical symbols systematically explicitly intro-duced into the discourse in mathematical scientific publications?

2. To what extent can symbol interpretation rely on larger local discoursecontext?

3. Can symbol interpretation be supported by an analysis of locally co-occurring symbolic expressions of similar structure?

Outline: The paper is organised as follows: In Section 2 we describe our corpus-based methodology: first we briefly describe the data set we use, followed bythe descriptons of our three study setups. In Section 3 we present quantitativeresults of our studies. We conclude with a discussion of the results in Section 4and discuss further work in Section 5.

2 Method

We performed three corpus studies in order to investigate symbol declarationpractices in mathematical scientific papers. In all the experiments we used

Symbol Declarations in Mathematical Writing: A Corpus Study 121

actual mathematical papers as they were originally published. The setup of ourstudies is outlined below.

2.1 Data and Preprocessing

The subsets of documents we used in the studies were randomly selectedfrom a corpus of 1,000 mathematical publications from the arXMLiv collection,processed by the LaTeXML architecture [14,15]. arXMLiv is subset of the arXiv,an archive of electronic preprints of scientific papers in the fields of, amongothers, mathematics, statistics, physics, and quantitative biology [2]. That is,the documents we analyzed were advanced scientific contributions written byprofessional mathematicians.

The documents have been word- and sentence-tokenized. For the analysis ofsymbolic expressions, we used two mathematical expression markup formats:the XMath format, a LaTeXML internal representation, and the PresentationMathML format, a widely used W3C standard for rendering mathematicalcontent on the Web [4,13].

2.2 Experiments

In the experiments presented here, we were interested in object-denoting termsof “simple” high-level structure. More specifically, as “simple” symbols weconsider atomic identifiers and super- or sub-scripted atomic identifiers; wedo not, however, analyse the expression(s) in the super-/sub-scripts. In thefollowing sections, we will use the term simple mathematical expression to refer tothis class of symbols. We extracted the expressions of interest by parsing theXMath and MathML representations.

The first study The purpose of the first experiment was to investigatemathematicians’ practices as to explicitly declaring symbols in their scientificwriting. We randomly selected 50 documents from the preprocessed collectionand from each document we randomly extracted 10 simple mathematicalexpressions. Next, we manually checked whether among the first 5 occurrencesof these expressions in the paper, the symbol is explicitly declared; i.e. weinspected 500 simple expressions (2,500 occurences).

In this and the following study, we considered two types of declarations: asymbol may be introduced in isolation, as in the fragment: “Let F be a Hermitianvector bundle over W . . . ”, or embedded in a larger symbolic expression whichadditionally elaborates the properties of the object denoted by the symbol, asin: “Consider the cylinder U = M× [−ε, 0) . . . ”, where U is further qualifiedto a have certain property. We will refer to the former type of declaration asunqualified and to the latter as elaborated (the declared symbol is further qualifiedby the sub-expression within which it appears). The point of this distinction is


Let<XMath>

<XMApp><XMTok role="SUBSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">C</XMTok><XMTok role="UNKNOWN" font="italic">i</XMTok>

</XMApp></XMath>be the closed convex hull in<XMath>

<XMTok role="UNKNOWN" font="italic">Y</XMTok></XMath>of the tail end of the sequence.

Fig. 1. A fragment of LaTeXML XMath markup with elements of unknown roles

that the declarations of elaborated expressions require more sophistication inthe process of their automated identification.3

The second study The second study was a more focused variant of thefirst study. This time we were interested in simple expressions which areembedded in complex expressions, in particular those simple expressionswhose grammatical role has not been recognized by the LaTeXML process.

The grammatical role, specified in the role attribute of the XMath markup,captures the syntactic nature of a symbol, the “grammatical role” that the objectplays in surrounding expressions. The role attribute is used in generating thepresentation markup and it can also help drive the derivation of the semanticsof an expression.

Examples of role attributes which the LaTeXML parser does recognizeinclude: ATOM (a general atomic subexpression), APPLYOP (an explicit infixapplication operator), RELOP (a relational operator), ADDOP (an additionoperator), INTOP (an integral operator) [1]. Unrecognized symbols are assignedan UNKNOWN attribute by default, as illustrated in Figure 1.

In this study, we randomly extracted a subset of 100 mathematicaldocuments from the collection. From each document we randomly selected 3mathematical expressions whose XMath representations contained at least 3simple expressions as defined above. For each of the resulting 300 expressions,we extracted all the simple sub-expressions tagged as UNKNOWN in the XMathrepresentation and which also occurred independently in the discourse. Forinstance, for the expression ρ =< ωi, λ >, we would have extracted thefollowing simple expressions should they be tagged as UNKNOWN: ρ, ωi, andλ. Then we would check which of the resulting simple expressions occurred

3 One could, for instance, assume that in the process of identifying declarationstatements, it would be sufficient to consider as candidates only those sentenceswhich contain unqualified simple object-denoting expressions. As we will show inthe next section, this approach would miss a small percentage of instances.


Table 1. Results of the first study

Category Occurrence n (%) N (%)

Explicitly declared unqualified

1st 290 (58%)

337 (67.4%)

2nd 15 (3.0%)3rd 11 (2.2%)4th 6 (1.2%)5th 2 (0.4%)

elaborated − 13 (2.6%)

Not explicitly declared 134 (26.7%)

Other 30 (5.9%)

in isolation in the document and select those for the analysis. We performedanalogous manual analysis of the extracted instances as in the first study.

The third study Now, assuming that new symbols are systematically properlydeclared, in the third study, we were interested in finding out whether relatedsymbols, which might not be individually introduced, indeed tend to besemantically related (that is, denote the same concept or different instancesof the same concept). More specifically, we looked at simple terms based onthe same main identifiers, i.e. sharing the same root/top-level node in theexpression tree, and which are structurally similar modulo the structure of thesubscript and superscipt terms. For instance, the following two expressions arestructurally similar according to our criteria: ωi and ωn−1. By contrast, P2

c andAk

n are not similar because they differ in the top-node operator.For each pair of such expressions we verified whether the objects they denote

are also semantically related if they occur in the same local discourse context.As discourse context we considered a section of a document; in the currentstudy we ignored sub-section scopes. We randomly selected 25 mathematicaldocuments and from each section of these documents we extracted all thepair-wise combinations of simple mathematical expressions which shared thesame root symbol (same identifier) and either have the same surface structureor one expression is embedded in the other; 496 such pairs were extracted.Again, we analysed the extracted pairs manually as to whether each pair ofexpressions denotes the same concept in the context of the section scope.

The point of this study was to empirically verify whether the local discoursescope is a good indicator of semantic relatedness of structurally similarterms. Identification of structurally similar pairs could be used in documentprocessing to construct sets of mathematical expressions which denote thesame mathematical concept: n symbolic expressions would form a set ifeach of the possible C2

n pair-wise combinations fulfilled the above-mentionedconditions. Consider, for instance, the (unordered) pairs of simple mathematicalexpressions: (c, c1), (c2, c1), and (c2, c) which fulfill the criteria. They form a set{c, c1, c2}. Assuming that the expression c has been previously interpreted, for


instance, as a constant, the expressions c1 and c2 are likely to have the samemathematical interpretation.

3 Results

The results of the analyses are presented in Tables 1 through 3.Table 1 shows the results of the symbol declaration study. The first column

contains the categories. We present the absolute and percentage counts for thetwo subcategories of explicit declarations and for the location of the declaration(the occurrence number from the beginning of the document which is part ofthe symbol’s declaration).

About 67% of simple mathematical expressions were explicitly introducedin the discourse. In most cases the first occurrence of a symbol is within adeclaration, however, as can be seen from the fourth column, ‘n (%)’, in somerare cases the declaration does not come till the fourth and fifth occurrence.It appears that for this study, extracting the first five occurrences was agood choice, with only two out of 336 instances being declared as far asthe fifth occurrence from the first mention. Moreover, in most cases symbolsin declarations do not appear as part of a larger expression (only about 3% ofoccurrences were elaborated by means of a symbolic expresson). In 6% caseswe encountered processing errors or were not able to distinguish how an objectwas declared.

Now, the results of the second study, Table 2, shows that about 72% of simplesub-terms of complex expressions, which were not recognized by LaTeXMLhave been explicitly introduced in the discourse. The declaration of most ofthese, again, appears together with the first occurrence of the expression, and,again unqualified declarations of these were more frequent. The remaining 27%of unknown symbols were not declared in the documents, so assigning thema role automatically based on the discourse context would perhaps requiresophisticated inferences based on the context of the other occurrences.

Finally, Table 3 shows the results of semantic relatedness of locally occurringstructurally similar expressions. Indeed, in most cases, 89%, structurally similarexpressions which share the root identifier are also semantically related. Wewere unable to relate the expressions in 5% of the cases.

4 Discussion

The results of the study show that mathematicians do indeed tend to explicitlyintroduce object-denoting symbols which they use in their writings. While it issomewhat surprising that symbol declarations occur past the first mention of asymbol (that is, symbols are used before they have been introduced) overall, thecontext of the first mention accounts for the majority of symbol declarations.

The findings of the first and the second study also indicate that the globaldiscourse context is a good starting point in an automated interpretation (and


Table 2. Results of the second study

Category Occurrence n (%) N (%)

Explicitly declared unqualified

1st 331 (53.5%)

449 (72.5%)

2nd 22 (3.5%)3rd 23 (3.7%)4th 7 (1.1%)5th 20 (3.2%)

elaborated − 46 (7.4%)

Not explicitly declared 170 (27.5%)

Table 3. Results of the third study

Category N (%)

Same concept 441 (88.9%)Different concept 28 (5.6%)Not classifed 27(5.4%)

disambiguation) of symbolic expressions in mathematical scientific documents.From a point of view of computational processing of mathematical discourse,this means that if the linguistic context in which a symbol appears can beparsed and interpreted (in particular, the first-mention context) then theintended usage of the symbol at hand, i.e. the symbol’s meaning, can berecognized. Interpretation recovered in this way would, in turn, help completethe information in the (semantic) mark-up of mathematical expressions.

Now, the last study shows that the structural similarity of mathematicalexpressions and their disourse proximity can be exploited in propagatingthe interpretation of mathematical symbols. That is, assuming the a set ofstructurally similar expressions can be identified in a local discourse contextand we can find the interpretation of one of them (for instance, using methodssuch as those proposed in [8]) then the interpretation of the related symbolscan be with a large likelihood assumed to be the same. This can be seen asanalogous to the “one sense per discourse” tendency in well-written prose(see [7]).

5 Conclusion and Further Work

In this paper, we presented the design and the results of three corpus-basedstudies on mathematical symbols in scientific papers, which were concernedwith explicit declarations of symbols’ denotations. The results of the studiesempirically motivate methods of automated disambiguation of mathematicalexpressions based on the discourse context in which the symbols appear. Whilethe data set we used was not large, the preliminary results we obtained areencouraging and suggest the need for comprehensive incremental interpretationas the methodology for semantic processing of mathematical documents. We are


planning to implement the results of the studies as part of a larger architecturefor mathematical expression disambiguation.

We are planning a number of follow-up studies: A natural continuationof the presented experiments would be to investigate the way symbolicmathematical expressions are declared, from the linguistic point of view. Thatis, to study the languagage of symbol declarations in mathematical discourse.While a number of lexico-syntactic patterns for symbol declarations can beanticipated based on general familiarity with mathematical writing (the obviousbeing “Let SYMBOL be a mathematical concept-denoting term”) given the size ofthe arXMLiv corpus we should be able to discover a variety of verbalizations.

Another natural follow-up direction which we are planning to pursue, is tolook in more details into the set of symbols of which we have not found explicitdeclarations in the documents. Is there systematism to what symbols tend tobe left unexplained, for their interpretation can be assumed as obvious? It iscommon knowledge that there are certain notational conventions in the usageof symbols, in mathematics in general and within sub-areas of mathematics (e.g.the use of mnemonics), can we automatically recognize these conventions basedon corpus analysis focused on symbol declarations? Finally, aside from theknowledge of notational conventions, what other kinds of knowledge would berequired to find automatically the interpretations of the remaining undeclaredsymbolic expressions in mathematical scientific documents?

Acknowledgments We would like to thank Deyan Ginev of Jacobs UniversityBremen without whose many preprocessing scripts it would not have beenpossible to conduct this study at ease. We would also like to thank the fouranonymous reviewers for their helpful comments.

References

1. LaTeXML Manual. http://dlmf.nist.gov/LaTeXML/manual/, Retrieved June 2010.2. arXiv.org e-Print archive, Retrieved June 2010. http://www.arxiv.org.3. arXMLiv Project. http://arxmliv.kwarc.info/, Retrieved April 2010.4. Ron Ausbrooks, Stephen Buswell David Carlisle, Giorgi Chavchanidze, Stéphane

Dalmas, Stan Devitt, Angel Diaz, Sam Dooley, Roger Hunter, Patrick Ion, MichaelKohlhase, Azzeddine Lazrek, Paul Libbrecht, Bruce Miller, Robert Miner, MurraySargent, Bruce Smith, Neil Soiffer, Robert Sutor, and Stephen Watt. MathematicalMarkup Language (MathML) version 3.0. W3C Working Draft of 24. September2009, World Wide Web Consortium, 2009.

5. Stephen Buswell, Olga Caprotti, David P. Carlisle, Michael C. Dewar, Marc Gaetano,and Michael Kohlhase. The Open Math standard, version 2.0. Technical report, TheOpen Math Society, 2004.

6. Herbert H. Clark. Arenas of Language Use. University Of Chicago Press, 1993.7. William A. Gale, Kenneth W. Church, and David Yarowsky. One sense per discourse.

In Proceedings of the HLT-91 Workshop on Speech and Natural Language, pages 233–237,1992.

http://dlmf.nist.gov/LaTeXML/manual/

http://www.arxiv.org

http://arxmliv.kwarc.info/


8. Mihai Grigore, Magdalena Wolska, and Michael Kohlhase. Towards context-baseddisambiguation of mathematical expressions. In The Joint Conference of ASCM 2009and MACIS 2009, volume 22 of Math-for-Industry, COE Lecture Note, pages 262–271,2009.

9. Donald Ervin Knuth, Tracy Larrabee, and Paul M. Roberts. Mathematical writing.The Mathematican Association of America, 1989.

10. Michael Kohlhase, Stefan Anca, Constantin Jucovschi, Alberto González Palomo,and Ioan A. Sucan. MathWebSearch 0.4, A Semantic Search Engine for Mathematics.http://search.mathweb.org/index.xhtml (Retrieved April 2010), 2008.

11. Michael Kohlhase and Ioan Sucan. A search engine for mathematical formulae. In:Tetsuo Ida, Jacques Calmet, and Dongming Wang, editors, Proceedings of ArtificialIntelligence and Symbolic Computation, AISC 2006, number 4120 in LNAI, pages241–253. Springer Verlag, 2006.

12. Steven George Krantz. A primer of mathematical writing. The Americal MathematicalSociety, 1997.

13. W3c math home. http://www.w3.org/Math/, Retrieved April 2010.14. Bruce Miller. LaTeXML: A LATEX to XML converter. Web Manual at http://dlmf.

nist.gov/LaTeXML/, seen April 2010.15. Heinrich Stamerjohanns, Michael Kohlhase, Deyan Ginev, Catalin David, and Bruce

Miller. Transforming large collections of scientific publications to XML. Mathematicsin Computer Science, 3(3):299–307, 2010.

http://search.mathweb.org/index.xhtml


http://dlmf.nist.gov/LaTeXML/

http://dlmf.nist.gov/LaTeXML/

Subject Index

abstract syntax tree 40ActiveMath 97ADS 91AJAX 65Amaya 98, 108AMS 12Apache Lucene 19Apache Solr 19arXiv 85automatic verbalization 95

BDI 80bdim 80bibliography crosslinking 11blahTeX 98Brno IV, 135

CEDRAM 4, 14, 70, 105, 106CEIC 12Cellule Mathdoc 14citations discovery and extraction 11compression 45Computer Algebra System 98Content MathML 37copy-math 69Cornell University Library 12corpus-based analysis 119

DC 27DC Abstract Model 30DCAP 28DCMI 29DCMI Description Set 30denial of service 15Description Set Profile 30digital archives IVdigital libraries IV, 11digital library 83digital signature 45digitization of documents IVdisambiguation of mathematical

expressions 119DjVu 45, 47, 53, 81DML 11, 45DML-CZ 5, 14, 45, 50, 51, 53, 57, 63, 69–72

DML-PL 14DML-PT 14document ranking 11DOI 29, 30DRIVER 19DSP 30Dublin Core 27Dublin Core Abstract Model 28Dublin Core Application Profile 27

eLearning 95ELibM 14EMANI 12EMIS 14EMS 12, 13, 24ERAM 14Euclid 4, 5, 29EuDML 4, 11, 13–15, 17, 45, 50, 63, 67, 69European Science Foundation 12Europeana 19

Fedora 81FineReader 53Firefox 108FIZ 14forms 57fulltext search IV

GNU General Public License 85Google 45, 64, 85Google Wonder Wheel 64GSearch 81

Hermes 98high-energy physics 83HTML5 103

information retrieval 11information systems 11Infty 4Infty system 38InftyReader 7INSPIRE 83, 84internationalization 57Invenio 83, 84, 91iText library 53

130 Subject Index

Java 17, 53, 98, 102JavaScript 57JB2 45JB2 algorithm 48JBIG2 45, 53jbig2enc 45, 48–52, 54jsMath library 86JSTOR 12

LATEX 4, 7, 41, 69–73, 75, 76, 83, 95, 97, 98,100–103, 105, 106, 109, 115, 117, 127, 135

Leptonica 49Linked Data 28localization 57LuaTEX 112Lucene 19

MAG standards 80MARCXML 85Masaryk University IV, V, 6, 135, 136Mathematica 90mathematical communication and

representation 95mathematical content search 11mathematical discourse 119Mathematical Literature Application

Profile 28mathematical markup standards 95mathematical metadata 11mathematical texts IVmathematics indexing 11Mathematics Subject Classification 63MathML 4, 69MDR 19Menaechmus 3, 130Menaechmus, 380–320 BC 3metadata curation 83Metadata Editor 57metadata exchange 27metadata generation 69Metadata Registry 19metadata standards 27MLAP 28, 30Moore Foundation 12MSC 63, 64MSC 2010 4Multivalent 51, 52

N3 65Natural Language Processing 135

NIST Digital Library of MathematicalFunctions 97

NUMDAM 12–14

OAI-ORE 4, 83, 90OAI-PMH 4, 15, 16, 20, 27, 80, 85OAIS 83, 91OCR 4, 37, 48OCR technology IVOMDoc 4, 97OpenAIRE 19OpenMath 4, 38OpenOffice Writer 100OpenURL 30Opera 110ORCID initiative 89

Paris IVpassiveTEX 106pattern recognition IVPDF 69PDF size optimization 45Pdfjbim 50–52pdfJbIm 5PdfLATEX 76pdfsign 5, 45, 46, 53, 54pdfsizeopt.py 45, 54PdfTEX 72, 135Perl 57PKI 53

RDF 28, 63, 64REPOX 11, 19REST 20Rodolfo 102Ruby 57RusDML 14

Scholarly Works Application Profile 28semantic ground truth for mathematical

characters and symbols 38Semantic Web 28SIMAI 80similarity analysis 11Similarity Service 20SOA 17SOAP 20Solr 19SPARQL 65SPIRES 85

Subject Index 131

SPIRES database 84Springer 79, 85, 135STIX fonts 109

TEI DTD 105TEL 19TELplus 19TeX4ht 98tex4moz 98text mining 11Texvc 98Tralics 5, 70, 105–107, 110–117translation 57TtM 98Turtle 65

UMI 80Unicode 112

University Library of Göttingen 14

validation 57Virtual Learning Environment 100Visual Browser 63, 64visual interface 63visualization 63

Web 2.0 11, 16web publishing 95Wiris 98, 99

XLST 70XMDR project 19XML 57, 69

YADDA 11, 19

Zentralblatt MATH 14, 105

Name Index

Adams, Scott 51Adams, Stephen 45Alexander, the Great 3

Borbinha, José VBouche, Thierry V, 29

Chlebíková, Janka VCohen, Leonard 5

Dennis, Keith 12Dodgson, Charles Lutwidge 4Doob, Michael V

Emil, robot III, VI, 1, 9, 43, 56, 77, 93, 132,134–136Euler, Leonhard 135Ewing, John 12

Fischer, Thomas V, 29Franek, Jirí III, VI, 1, 9, 43, 56, 77, 93, 132,134–136

Gagnon, Franc 46Ginev, Deyan 126Goutorbe, Claude 29

Hàn Thê Thành 135Haralambous, Yannis V

Hlavác, Václav V

Knuth, Donald Ervin 51Kohlhase, Michael V

Langley, Adam 48

Maciás-Virgós, Enrique VMatisse, Henri 50Moore, Betty 12Moore, Gordon 12

Rákosník, Jirí VRideau, Laurence V, 6Rioboo, Renaud V, 6Rocha, Eugénio VRuddy, David V, 5, 29Ružicka, Michal V, 135, 136

Sojka, Petr IV, V, 135, 136Sorge, Volker VSuzuki, Masakazu V, 4Szabó, Péter 45

Thompson, Ken 53Tondeur, Philippe 12

Zapf, Hermann 135

Author Index

Borbinha, José 11Bouche, Thierry 11

Córcoles, César 95

Filej, Miha 57

Grigore, Mihai 119Grimm, José 105

Hatlapatka, Radim 45Holtkamp, Annette 83Huertas, Antonia 95

Mele, Salvatore 83

Neverilová, Zuzana 63

Nowinski, Aleksander 11

Ruddy, David 27Ružicka, Michal 57, 69

Šárfy, Martin 57Sexton, Alan 37Šimko, Tibor 83Smith, Tim 83Sojka, Petr 3, 11, 45, 57, 69Sorge, Volker 37Suzuki, Masakazu 7, 37Sylwestrzak, Wojtek 11

Wolska, Magdalena 119

Zelati, Vittorio Coti 79

Colophon

The DML 2010 proceedings were produced from the authors’ electronicmanuscripts. Following the guidelines, the authors prepared their papers usingLATEX markup, with one exception.

Contributions were edited into the uniform markup of Springer llncs styleand custom-written TEX macros, and were processed by the proceedings editorin Brno. One paper was converted into LATEX from Microsoft Word.

Michal Ružicka helped with entering hundreds of spelling and typographi-cal corrections into the text corpora of the LATEX files.

The proceedings was typeset in Palatino by Hermann Zapf and in AMS Eulerfonts named after pioneering mathematician Leonhard Euler. The book wastypeset using the PdfTEX typesetting system primarily developed by Hàn ThêThành during his studies in Brno (1990–2001). Microtypographical extensionsthat PdfTEX implements were used, and book was composed with the LATEXmacro package in a single TEX run. Generating the hypertext version of theproceedings in PDF was done from the same source files.

The main editing, typesetting and proofreading steps were undertaken atthe Natural Language Processing Laboratory of the Faculty of Informatics,Masaryk University in Brno.

The proceeding editor thank sincerely all the authors for their contributionsand everybody who was involved in the book production. Without their hardand diligent work the proceedings would not have been in such a good shapeand ready on time for the DML 2010 workshop.Brno, July 2010 Petr Sojka

http://en.wikipedia.org/wiki/AMS_Euler

http://en.wikipedia.org/wiki/AMS_Euler


DML 2010Towards a Digital Mathematics Library

Paris, FranceJuly 7–8th, 2010

Proceedings

Petr Sojka (editor)

Published by Masaryk University, Brno in 2010

Typesetting, cover design: Petr Sojka

Illustrations: Jirí Franek

Data editing: Michal Ružicka, Petr Sojka

First edition, 2010

ISBN 978-80-210-5242-0



Date post:	17-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times