Inforex a collaborative system for text corpora annotation ... 2017/pdf/RANLP063.pdfInforex offers...

Proceedings of Recent Advances in Natural Language Processing, pages 473–482,Varna, Bulgaria, Sep 4–6 2017.

https://doi.org/10.26615/978-954-452-049-6_063

Inforex — a Collaborative Systemfor Text Corpora Annotation and Analysis

Michał Marcinczuk Marcin OleksyG4.19 Research Group

Department of Computational IntelligenceFaculty of Computer Science and Management

Wrocław University of Technology, Wrocław, Poland{michal.marcinczuk,marcin.oleksy,jan.kocon}@pwr.edu.pl

Jan Kocon

Abstract

We report a first major upgrade of In-forex — a web-based system for qualita-tive and collaborative text corpora anno-tation and analysis. Inforex is a part ofPolish CLARIN infrastructure1. It is inte-grated with a digital repository for storingand publishing language resources2 and itallows to visualize, browse and annotatetext corpora stored in the repository. Asa result of a series of workshops for re-searchers in Humanities and Social Sci-ences we improved the graphical inter-face to make the system more friendly andreadable for non-experienced users. Wealso implemented a new functionality fora gold standard annotation which includesprivate annotations and annotation agree-ment by a super-annotator.

1 Introduction

Digital humanities (DH) create new demand andchallenges for development of new or existingtools and systems for text documents manip-ulation, processing, analysis and visualization.CLARIN-PL — the Polish part of CLARIN infras-tructure — tries to rise the challenges associatedwith DH for Polish language. Among many otherissues, there is a need for an intuitive and easyto use system for qualitative text corpora manage-ment, annotation, analysis and visualization. Tofulfill these needs we develop such a system calledInforex. In this article we present the current stateof the system development.

The decision to create a system for text cor-pora annotation was taken in 2009 when therewere no such systems which support collaborative

1http://clarin-pl.eu2http://clarin-pl.eu/dspace

work. On that time the only existing tools weredesktop applications for individual work such asGATE (Cunningham et al., 2011) or Manufak-turzysta Luna (Marciniak et al., 2010). Since2010 several systems have emerged, like We-bAnno 3 (Eckart de Castilho et al., 2016) or GATETeamware (Bontcheva et al., 2013).

The first version of Inforex system was re-leased in 2010 and its initial role was to constructcorpus-based linguistic resource for various tasksfrom the field of natural language processing,including named entity recognition (Marcinczuket al., 2011), shallow parsing (Radziszewski andPiasecki, 2010), word sense disambiguation (Baset al., 2008), recognition of semantic relationsbetween named entities (Marcinczuk and Ptak,2012). It was used to develop two major (at thattime) resources for Polish: Corpus of WrocławUniversity of Technology called KWPr (Brodaet al., 2012) (within the NEKST3 project) andCorpus of Economic News (CEN) (Marcinczuket al., 2013) (within the SyNaT project4). Later,in 2013 Inforex was used to construct another ma-jor resource, which is Polish Corpus of SuicideNotes (PCSN)5 (Marcinczuk et al., 2011) guidedby Monika Zasko-Zielinska (2013). Until now thesystem has been used to access the corpus. Theaccess is granted on a demand after obtaining apermission form Wrocław University.

In 2013 Poland joined CLARIN — EuropeanResearch Infrastructure for Language Resourcesand Technology. The goal of CLARIN is tomake the language technologies more accessibleto researches from humanities and social sciences,which in most cases do not have the technicalskills to use many of the tools on their own. At thattime we made a decision to make Inforex a part

3http://nekst.ipipan.waw.pl/4http://www.synat.pl/5http://pcsn.uni.wroc.pl/

473

https://doi.org/10.26615/978-954-452-049-6_063

of the Polish CLARIN infrastructure. In 2015–2017 we have organized several workshops for re-searchers in humanities and social sciences. Theworkshops showed us several user experience is-sues. System GUI turned out to be not enoughintuitive for non-experienced users. Then, firstof all, it needed to be simplified. Second prob-lem was connected with the methodology. Theresearchers use various tools for corpora analy-sis (including spreadsheets) and Inforex may betreated as some kind of pre-processing tool thatallows to prepare corpus for further analysis. Dataexport was possible but complicated and requiredan access to a database. Users feedback provedthat the easy form of data export is one of thecrucial needs. After the set of workshops wegathered more information about other importantneeds (also in the form of questionnaires) like ac-cess to a custom annotation schemas definition ordata visualisation. Some of them have been al-ready implemented and the other are under con-struction.

2 Inforex Features Overview

In the following sections we present the mainfunctionalities and features of the Inforex system.

2.1 Web-based Access

Inforex is a web-based tool which does not re-quire installation. It can be accessed by any web-browser which support JavaScript. Despite In-forex is built on several universal JavaScript li-braries and frameworks (jQuery, jQuery exten-sions and Bootstrap) we suggest using Chromeand Firefox. These two web browsers are used totest the system on daily bases. Users might useother browsers as well, however we are not able tovalidate all functions in each of the available webbrowsers, thus some minor issues might occur.

2.2 Authorized and Public Access

Corpora stored in Inforex can be accessed by au-thorized and unauthorized users. The manager ofthe corpus (the owner or a user with specific priv-ileges) decides what type of information from thecorpora can be publicly available. For instance,only authorized users can have access to docu-ments’ content and can modify the corpus anno-tations while unauthorized users may have accessto some statistics or annotation frequency lists.

2.3 Integration with DSpace as a Part ofPolish CLARIN Infrastructure

Inforex system is available at http://inforex.clarin-pl.eu and it is part ofPolish CLARIN infrastructure. This installation isintegrated with the official repository for languageresources in Polish CLARIN6. The repositoryruns on DSpace system7. When a user registersin https://clarin-pl.eu/dspace/, healso gains access to Inforex system. At this stageaccounts are automatically synchronized. In thefuture both systems will use unified federationauthorization.

2.4 Collaboration

Inforex offers several ways for collaborative workon a single corpus. One of them is the access tothe same corpora for different authorized users.The other one is a selective, task-oriented access tothe same document. For instance, different groupsof users can have access to document’s metadata.The last one is the ”2+1” annotation, i.e. two ormore users annotate the same set of documents in-dependently and the super-annotator creates the fi-nal set of annotations based on their input. Moreabout this type of collaboration is presented inSection 3.2.

2.5 Qualitative Document Annotation

Inforex was designed for qualitative document an-notation. This means it does not offer a fast and ro-bust search functions over large corpora with mil-lions of documents. Such functionality can be ob-tained using other existing tools designed for it,for instance Sketch Engine (Kilgarriff et al., 2014)or NoSketch Engine (Rychly, 2007). Inforex issuited for medium size corpora (containing thou-sands of small documents) and to manually de-scribe documents in terms of their metadata, an-notations (types of phrases organized in a hierar-chy), annotation attributes, relations between an-notations and annotation frames.

2.6 Language-independent

Inforex is language-independent in the sense thatit can handle documents in any natural language.So far it has been used to annotate Polish, Englishand Hebrew texts (see Section 3.2).

6https://clarin-pl.eu/dspace/7https://github.com/ufal/clarin-dspace

474

Figure 1: Corpus overview

2.7 Document Visualisation

Inforex can handle documents in two formats:plain text and XML. For XML documents it ispossible to display their content in a visually for-mated way. This allows to highlight the documentstructure what improves the user experience whilebrowsing and annotating documents. Sample visu-alizations of different types of documents are pre-sented in Figure 3.

2.8 Document Description

Inforex supports four types of information unitswhich can be used to describe documents content:

1. Metadata — an information unit which isassigned to whole document (author name,document creation time, source, etc.).

2. Annotation — an information unit which isassigned to a sequence of words in the doc-ument content. Each annotation is describedwith a category (categories can be organizedin a hierarchy) and a set of attributes. The setof attributes depends on the semantic inter-pretation of the annotation category. For in-stance, for named entities it can be a lemma,for temporal expressions it can be a normal-ized value of the expression and for eventmentions it can be an event modality.

3. Relation — an information unit which is as-signed to a pair of annotations. It is a directedlink between two annotations of some cate-gory.

4. Frame — an information unit which is as-signed to a set of annotations. Frame consistsof a set of annotations with roles assigned tothem. This type of structure can be used forevent annotations (LCD, 2005).

3 Recent Improvements

In the following sections we present the recent ma-jor improvements of Inforex system.

3.1 Modern Layout

A set of workshops carried out from 2015 to 2017showed that there was the need for an adjustmentof user interface to a new group of users — re-searchers in humanities and social sciences not in-volved in NLP tools development. New users re-ported confusion with the large amount of infor-mation and the number of available functions. Theneed of interface simplification appeared whilefunctionalities of the system would remain un-changed. Thus, Inforex layout has been upgradedand modernized. It involved not only a design lift-ing of the user interface but also changes in nav-igation panels. The comparision of old and new

475

Figure 2: Document annotation view

layout is presented in Figure 4.

3.2 Annotation Agreement

Reliability is a key value in the creation of agood quality corpora for learning and testing ofNLP tools. The current version of Inforex en-ables simultaneous and independent annotation ofthe same text sample by more than one annota-tor. Moreover, the annotation process coordina-tor may keep track of inter-annotator agreementbetween two raters thanks to the Agreement mod-ule which uses Positive Specific Agreement (PSA)measure (Hripcsak and Rothschild, 2005) to cal-culate the reliability (see Figure 5). View config-uration gives the opportunity to define annotationlayers, subsets or categories, users and set of doc-uments that have to be analysed. The coordina-tor may also specify a comparison mode: whetherthe system has to take into consideration the an-notation boundaries only or boundaries and cat-egories. It may also include annotation lemmas.Inter-annotator agreement is a very important in-dicator of the annotation guidelines clearness orcohesion. Keeping track of changes of the inter-annotator agreement between subsequent annota-tion iterations helps to improve the quality of theannotation guidelines. Agreement module makesthat process easier and faster.

Inforex system also supports the curation of

the annotation process (see Figure 6). The cu-rator can make choice between two different an-notators choices, or even reject consistent but in-correct annotations. Thanks to that module sev-eral Gold Standard projects were performed e.g.Polish Coreference Corpus (Ogrodniczuk et al.,2015) for definite descriptions annotation and Pol-ish Spatial Texts corpus for the annotation of dy-namic spatial expressions.

4 Applications

In the following sections we present several prac-tical applications of the Inforex system.

4.1 KPWrKPWr (Polish Corpus of Wrocław University ofTechnology) (Broda et al., 2012) is a corpus ofwritten and spoken documents available on theCreative Commons license which is intended pri-marily as a training and testing material for NLPtools being developed at Wrocław University ofScience and Technology. It is successively en-riched with annotation layers. Inforex recentlysupported manual text annotation within such lay-ers as temporal expressions and their normaliza-tions, events (and description of event attributes),spatial expressions and semantic roles. In order toprepare temporal expressions annotation (Koconet al., 2015) a new annotation scheme based on

476

(a) Facebook conversation.

(b) Wikipedia article. (c) Hebrew document.

Figure 3: Sample documents visualizations

TimeML was added. These categories refer to adate, time of a day, duration and frequency of anevent. Annotation lemmas perspective was used toprovide normalized temporal expressions, reveal-ing that the term ’lemma’ in Inforex may func-tion as a broad concept. The Annotator perspec-tive from the system also supports event annota-tion (Marcinczuk et al., 2015). There are sevencoarse-grained categories of events, i.e. action,state, reporting, perception, aspectual, intensionalaction and intensional state. The categorizationwas based on the TimeML guidelines with somemodifications. It also involved creation of a newannotation scheme. The flexibility in adding newannotation layers (setting the new annotation cate-gories) is one of the most important features. Thepossibility of establishing relations between anno-tated fragments is not less relevant. It was cru-cial e.g. for spatial expressions annotation. Itsmain goal was to extract different ways of dis-tributing spatial information throughout a sentenceby reviewing the lexical and grammatical signalsof various relations between objects (Marcinczuket al., 2016).

4.2 European Legal Texts

As practice shows, although Inforex was primar-ily developed for Polish language, that it can alsobe used to work with documents written in otherlanguages. Inforex features and functionalities areuseful e.g. in examining current EU official lit-erature related to territorial development and ur-ban planning. Authors of this analysis first up-loaded EU Territorial Policy Documents 2007-20168 to CLARIN-PL DSpace repository and thenimported it to the Inforex system. The corpus wasdivided into 4 subcorpora and prepared for qual-itative and quantitative analysis. The review ofthe key strands enabled the identification of its 8core values (or principles) for further statisticaland contextual analysis. After ascribing to eachcategory its textual triggers (word forms), a quan-titive analysis using words frequency lists gener-ated by Inforex was performed. Manual annota-tion with a newly defined set of annotations andAnnotation Browser with the possibility of export-ing data were a great support for qualitative anal-ysis — detailed contextual analysis of the corpusfocused on two crucial categories: Participationand Communication.

8http://hdl.handle.net/11321/316

477

(a) Inforex layout before modernization

(b) Inforex layout after modernization

Figure 4: Inforex layouts comparison

4.3 Hebrew CorpusInforex supports manual annotation even if the textis written using non-latin alphabet and a right-to-left notation. One of the system applications wasrelated to a corpus of Hebrew gravestone inscrip-tions. It also involved the creation of a new an-notation schema. Categories referred mainly tothe pragmatic level of communication (e.g. initialand final expressions, laudations, death circum-stances). The perspective of annotation lemmaswas used to enter Polish translations of annotatedfragments, which also showed that the lemma at-tribute may be a broad term especially in the caseof practical applications of the system.

4.4 Other CorporaInforex was used to prepare the training data dur-ing participation in BSNLP 2017 shared task on

multilingual named entity recognition aimed atrecognizing mentions of named entities in webdocuments in Slavic languages, their normaliza-tion / lemmatization, and cross-language matching(Marcinczuk et al., 2017). The system also sup-ported the annotation of the corpora constructedspecially for specific tasks from the field of naturallanguage processing e.g. Polish Coreference Cor-pus for definite descriptions annotation and PolishSpatial Texts corpus for the annotation of dynamicspatial expressions. It involved creation of dedi-cated annotation layers but, what is important, inthese tasks the new module of the system (Anno-tation Agreement and ”2+1” annotation) was usedfor the first time, which significantly improved thetime of preparation of annotated training and test-ing corpora.

478

5 Summary

Inforex system, as a part of CLARIN-PL infras-tructure, is gradually developed. Although its ini-tial role was to construct qualitative linguistic re-sources for various tasks from the field of natu-ral language processing, recently it is also usedby scientists for other purposes. We received animportant and constructive feedback from usersduring and after workshops related to CLARIN-PL tools and resources. As users have differentneeds, we identified the common functionalitiesand implement them as soon as possible in orderto boost their research tasks and provide new pos-sibilities. We also challenged with the fact thatmany researches from the field of digital humani-ties are not experienced users of such systems andwe made Inforex as easy and intuitive as possible.

Acknowledgments

Work financed as part of the investment in theCLARIN-PL research infrastructure funded by thePolish Ministry of Science and Higher Education.

ReferencesDominik Bas, Bartosz Broda, and Maciej Piasecki.

2008. Towards Word Sense Disambiguation ofPolish. In Proceedings of the InternationalMulticonference on Computer Science and In-formation Technology, {IMCSIT} 2008, Wisla,Poland, 20-22 October 2008. IEEE, pages 73–78.https://doi.org/10.1109/IMCSIT.2008.4747220.

Kalina Bontcheva, Hamish Cunningham, Ian Roberts,Angus Roberts, Valentin Tablan, Niraj Aswani,and Genevieve Gorrell. 2013. Gate teamware: aweb-based, collaborative text annotation framework.Language Resources and Evaluation 47(4):1007–1029.

Bartosz Broda, Michał Marcinczuk, Marek Maziarz,Adam Radziszewski, and Adam Wardynski. 2012.KPWr: Towards a Free Corpus of Polish. In Nico-letta Calzolari, Khalid Choukri, Thierry Declerck,Mehmet Ugur Dogan, Bente Maegaard, Joseph Mar-iani, Jan Odijk, and Stelios Piperidis, editors, Pro-ceedings of LREC’12. ELRA, Istanbul, Turkey.

Hamish Cunningham, Diana Maynard, KalinaBontcheva, Valentin Tablan, Niraj Aswani, IanRoberts, Genevieve Gorrell, Adam Funk, An-gus Roberts, Danica Damljanovic, ThomasHeitz, Mark A. Greenwood, Horacio Saggion,Johann Petrak, Yaoyong Li, and Wim Peters.2011. Text Processing with GATE (Version 6).http://tinyurl.com/gatebook.

Richard Eckart de Castilho, Eva Mujdricza-Maydt,Seid Muhie Yimam, Silvana Hartmann, IrynaGurevych, Anette Frank, and Chris Biemann. 2016.A web-based tool for the integrated annotation ofsemantic and syntactic structures. In Proceedingsof the workshop on Language Technology Resourcesand Tools for Digital Humanities (LT4DH) at COL-ING 2016. pages 76–84.

George Hripcsak and Adam S. Rothschild. 2005.Agreement, the f-measure, and reliability in infor-mation retrieval. J. of Am. Medical Informatics As-sociation 12(3):296–298.

Adam Kilgarriff, Vıt Baisa, Jan Busta, MilosJakubıcek, Vojtech Kovar, Jan Michelfeit, PavelRychly, and Vıt Suchomel. 2014. The sketch en-gine: ten years on. Lexicography .

Jan Kocon, Michał Marcinczuk, Marcin Oleksy,Tomasz Bernas, and Michał Wolski. 2015. Tem-poral expressions in polish corpus kpwr. CognitiveStudies— Etudes cognitives (15):293–317.

LCD. 2005. ACE (Automatic Content Extraction) En-glish Annotation Guidelines for Events. Technicalreport, Linguistic Data Consortium.

M. Marcinczuk and M. Ptak. 2012. Preliminary studyon automatic induction of rules for recognition ofsemantic relations between proper names in Polishtexts, volume 7499 LNAI.

Michał Marcinczuk, Jan Kocon, and MarcinOleksy. 2017. Liner2 — a generic frameworkfor named entity recognition. In Proceedingsof the 6th Workshop on Balto-Slavic NaturalLanguage Processing. Association for Computa-tional Linguistics, Valencia, Spain, pages 86–91.http://www.aclweb.org/anthology/W17-1413.

Michał Marcinczuk, Marcin Oleksy, Tomasz Bernas,Jan Kocon, and Michał Wolski. 2015. Towards anevent annotated corpus of polish. Cognitive Stud-ies— Etudes cognitives (15):253–267.

Michał Marcinczuk, Michał Stanek, Maciej Piasecki,and Adam Musiał. 2011. Rich Set of Features forProper Name Recognition in Polish Texts. In SIIS2011. Springer.

Michał Mirosław Marcinczuk, Marcin Oleksy, and JanWieczorek. 2016. Towards recognition of spatial re-lations between entities for polish. Cognitive Stud-ies— Etudes cognitives (16):119–132.

Małgorzata Marciniak, Agnieszka Mykowiecka, andKatarzyna Głowinska. 2010. Anotowany korpus di-alogow telefonicznych. In Małgorzata Marciniak,editor, Anotowany korpus dialogow telefonicznych,Akademicka Oficyna Wydawnicza EXIT, Warsaw,chapter Anotacja korpusu LUNA–WOZ.PL, pages217–230.

479

Michał Marcinczuk, Jan Kocon, and Maciej Janicki.2013. Liner2 – a customizable framework forproper names recognition for Polish. In RobertBembenik, Lukasz Skonieczny, Henryk Rybinski,Marzena Kryszkiewicz, and Marek Niezgodka, ed-itors, Intelligent Tools for Building a Scientific In-formation Platform, pages 231–253.

Michał Marcinczuk, Monika Zasko-Zielinska, and Ma-ciej Piasecki. 2011. Structure annotation in the pol-ish corpus of suicide notes. In Ivan Habernal andVaclav Matousek, editors, Text, Speech and Dia-logue, Springer Berlin Heidelberg, volume 6836 ofLecture Notes in Computer Science, pages 419–426.

Maciej Ogrodniczuk, Katarzyna Głowinska, MateuszKopec, Agata Savary, and Magdalena Zawisławska.2015. Coreference in Polish: Annotation, Res-olution and Evaluation. Walter De Gruyter.http://www.degruyter.com/view/product/428667.

Adam Radziszewski and Maciej Piasecki. 2010. A Pre-liminary Noun Phrase Chunker for Polish. Proceed-ings of the Intelligent Information Systems pages169–180.

Pavel Rychly. 2007. Manatee/bonito - a modularcorpus manager. In 1st Workshop on Recent Ad-vances in Slavonic Natural Language Processing.Masarykova univerzita, Brno, pages 65–70.

M. Zasko-Zielinska. 2013. Listy pozegnalne:w poszukiwaniu lingwistycznych wyz-nacznikow autentycznosci tekstu. Quaestio.https://books.google.pl/books?id=QG60ngEACAAJ.

480

Figure 5: Summary of annotation agreement for a set of document

481

Figure 6: User agreement verification for a single document

482

Date post:	29-Mar-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times