+ All Categories
Home > Documents > Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 ›...

Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 ›...

Date post: 28-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
57
Results of the Ontology Alignment Evaluation Initiative 2016 ? Manel Achichi 1 , Michelle Cheatham 2 , Zlatan Dragisic 3 , J´ erˆ ome Euzenat 4 , Daniel Faria 5 , Alfio Ferrara 6 , Giorgos Flouris 7 , Irini Fundulaki 7 , Ian Harrow 8 , Valentina Ivanova 3 , Ernesto Jim´ enez-Ruiz 9,10 , Elena Kuss 11 , Patrick Lambrix 3 , Henrik Leopold 12 , Huanyu Li 3 , Christian Meilicke 11 , Stefano Montanelli 6 , Catia Pesquita 13 , Tzanina Saveta 7 , Pavel Shvaiko 14 , Andrea Splendiani 15 , Heiner Stuckenschmidt 11 , Konstantin Todorov 1 , C´ assia Trojahn 16 , and Ondˇ rej Zamazal 17 1 LIRMM/University of Montpellier, France [email protected] 2 Data Semantics (DaSe) Laboratory, Wright State University, USA [email protected] 3 Link¨ oping University & Swedish e-Science Research Center, Link¨ oping, Sweden {zlatan.dragisic,valentina.ivanova,patrick.lambrix}@liu.se 4 INRIA & Univ. Grenoble Alpes, Grenoble, France [email protected] 5 Instituto Gulbenkian de Ciˆ encia, Lisbon, Portugal [email protected] 6 Universit` a degli studi di Milano, Italy {alfio.ferrara,stefano.montanelli}@unimi.it 7 Institute of Computer Science-FORTH, Heraklion, Greece {jsaveta,fgeo,fundul}@ics.forth.gr 8 Pistoia Alliance Inc., USA [email protected] 9 Department of Informatics, University of Oslo, Norway [email protected] 10 Department of Computer Science, University of Oxford, UK 11 University of Mannheim, Germany {christian,elena,heiner}@informatik.uni-mannheim.de 12 Vrije Universiteit Amsterdam, The Netherlands [email protected] 13 LASIGE, Faculdade de Ciˆ encias, Universidade de Lisboa, Portugal [email protected] 14 TasLab, Informatica Trentina, Trento, Italy [email protected] 15 Novartis Institutes for Biomedical Research, Basel, Switzerland [email protected] 16 IRIT & Universit´ e Toulouse II, Toulouse, France {cassia.trojahn}@irit.fr 17 University of Economics, Prague, Czech Republic [email protected] Abstract. Ontology matching consists of finding correspondences between se- mantically related entities of two ontologies. OAEI campaigns aim at comparing ? The official results of the campaign are on the OAEI web site.
Transcript
Page 1: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Results of theOntology Alignment Evaluation Initiative 2016?

Manel Achichi1, Michelle Cheatham2, Zlatan Dragisic3, Jerome Euzenat4,Daniel Faria5, Alfio Ferrara6, Giorgos Flouris7, Irini Fundulaki7, Ian Harrow8,Valentina Ivanova3, Ernesto Jimenez-Ruiz9,10, Elena Kuss11, Patrick Lambrix3,

Henrik Leopold12, Huanyu Li3, Christian Meilicke11, Stefano Montanelli6,Catia Pesquita13, Tzanina Saveta7, Pavel Shvaiko14, Andrea Splendiani15, HeinerStuckenschmidt11, Konstantin Todorov1, Cassia Trojahn16, and Ondrej Zamazal17

1 LIRMM/University of Montpellier, [email protected]

2 Data Semantics (DaSe) Laboratory, Wright State University, [email protected]

3 Linkoping University & Swedish e-Science Research Center, Linkoping, Sweden{zlatan.dragisic,valentina.ivanova,patrick.lambrix}@liu.se

4 INRIA & Univ. Grenoble Alpes, Grenoble, [email protected]

5 Instituto Gulbenkian de Ciencia, Lisbon, [email protected]

6 Universita degli studi di Milano, Italy{alfio.ferrara,stefano.montanelli}@unimi.it7 Institute of Computer Science-FORTH, Heraklion, Greece{jsaveta,fgeo,fundul}@ics.forth.gr

8 Pistoia Alliance Inc., [email protected]

9 Department of Informatics, University of Oslo, [email protected]

10 Department of Computer Science, University of Oxford, UK11 University of Mannheim, Germany

{christian,elena,heiner}@informatik.uni-mannheim.de12 Vrije Universiteit Amsterdam, The Netherlands

[email protected] LASIGE, Faculdade de Ciencias, Universidade de Lisboa, Portugal

[email protected] TasLab, Informatica Trentina, Trento, Italy

[email protected] Novartis Institutes for Biomedical Research, Basel, Switzerland

[email protected] IRIT & Universite Toulouse II, Toulouse, France

{cassia.trojahn}@irit.fr17 University of Economics, Prague, Czech Republic

[email protected]

Abstract. Ontology matching consists of finding correspondences between se-mantically related entities of two ontologies. OAEI campaigns aim at comparing

? The official results of the campaign are on the OAEI web site.

Page 2: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

ontology matching systems on precisely defined test cases. These test cases canuse ontologies of different nature (from simple thesauri to expressive OWL on-tologies) and use different modalities, e.g., blind evaluation, open evaluation, orconsensus. OAEI 2016 offered 9 tracks with 22 test cases, and was attended by21 participants. This paper is an overall presentation of the OAEI 2016 campaign.

1 IntroductionThe Ontology Alignment Evaluation Initiative1 (OAEI) is a coordinated internationalinitiative, which organises the evaluation of an increasing number of ontology matchingsystems [18,21]. Its main goal is to compare systems and algorithms openly and onthe same basis, in order to allow anyone to draw conclusions about the best matchingstrategies. Furthermore, our ambition is that, from such evaluations, tool developers canimprove their systems.

Two first events were organised in 2004: (i) the Information Interpretation andIntegration Conference (I3CON) held at the NIST Performance Metrics for Intelli-gent Systems (PerMIS) workshop and (ii) the Ontology Alignment Contest held atthe Evaluation of Ontology-based Tools (EON) workshop of the annual InternationalSemantic Web Conference (ISWC) [41]. Then, a unique OAEI campaign occurred in2005 at the workshop on Integrating Ontologies held in conjunction with the Inter-national Conference on Knowledge Capture (K-Cap) [4]. From 2006 until now, theOAEI campaigns were held at the Ontology Matching workshop, collocated with ISWC[19,17,6,14,15,16,2,9,12,8], which this year took place in Kobe, JP2.

Since 2011, we have been using an environment for automatically processing eval-uations (§2.2), which has been developed within the SEALS (Semantic Evaluation AtLarge Scale) project3. SEALS provided a software infrastructure, for automatically exe-cuting evaluations, and evaluation campaigns for typical semantic web tools, includingontology matching. In the OAEI 2016, all systems were executed under the SEALSclient in all tracks, and evaluated with the SEALS client in all tracks. This year wewelcomed two new tracks: the Disease and Phenotype track, sponsored by the PistoiaAlliance Ontologies Mapping project, and the Process Model Matching track. Addi-tionally, the Instance Matching track featured a total of 7 matching tasks based on allnew data sets. On the other hand, the OA4QA track was discontinued this year.

This paper synthesises the 2016 evaluation campaign. The remainder of the paperis organised as follows: in Section 2, we present the overall evaluation methodologythat has been used; Sections 3-11 discuss the settings and the results of each of the testcases; Section 12 overviews lessons learned from the campaign; and finally, Section 13concludes the paper.

2 General methodologyWe first present the test cases proposed this year to the OAEI participants (§2.1). Then,we discuss the resources used by participants to test their systems and the execution

1 http://oaei.ontologymatching.org2 http://om2016.ontologymatching.org3 http://www.development.seals-project.eu

Page 3: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

environment used for running the tools (§2.2). Finally, we describe the steps of theOAEI campaign (§2.3-2.5) and report on the general execution of the campaign (§2.6).

2.1 Tracks and test cases

This year’s OAEI campaign consisted of 9 tracks gathering 22 test cases, and differentevaluation modalities:

The benchmark track (§3): Like in previous campaigns, a systematic benchmark se-ries has been proposed. The goal of this benchmark series is to identify the areasin which each matching algorithm is strong or weak by systematically altering anontology. This year, we generated a new benchmark based on the original biblio-graphic ontology and another benchmark using a film ontology.

The expressive ontology track offers alignments between real world ontologies ex-pressed in OWL:Anatomy (§4): The anatomy test case is about matching the Adult Mouse

Anatomy (2744 classes) and a small fragment of the NCI Thesaurus (3304classes) describing the human anatomy.

Conference (§5): The goal of the conference test case is to find all correct cor-respondences within a collection of ontologies describing the domain of or-ganising conferences. Results were evaluated automatically against referencealignments and by using logical reasoning techniques.

Large biomedical ontologies (§6): The largebio test case aims at finding align-ments between large and semantically rich biomedical ontologies such asFMA, SNOMED-CT, and NCI. The UMLS Metathesaurus has been used asthe basis for reference alignments.

Disease & Phenotype (§7): The disease & phenotype test case aims at findingalignments between two disease ontologies (DOID and ORDO) as well as be-tween human (HPO) and mammalian (MP) phenotype ontologies. The evalua-tion was semi-automatic: consensus alignments were generated based on thoseproduced by the participating systems, and the unique mappings found by eachsystem were evaluated manually.

MultilingualMultifarm (§8): This test case is based on a subset of the Conference data set,

translated into ten different languages (Arabic, Chinese, Czech, Dutch, French,German, Italian, Portuguese, Russian, and Spanish) and the correspondingalignments between these ontologies. Results are evaluated against these align-ments.

Interactive matchingInteractive (§9): This test case offers the possibility to compare different match-

ing tools which can benefit from user interaction. Its goal is to show if userinteraction can improve matching results, which methods are most promisingand how many interactions are necessary. Participating systems are evaluatedon the conference data set using an oracle based on the reference alignment,which can generate erroneous responses to simulate user errors.

Instance matching (§10). The track aims at evaluating the performance of match-ing tools when the goal is to detect the degree of similarity between pairs of

Page 4: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

test formalism relations confidence modalities language SEALS

benchmark OWL = [0 1] blind EN√

anatomy OWL = [0 1] open EN√

conference OWL =, <= [0 1] open+blind EN√

largebio OWL = [0 1] open EN√

phenotype OWL = [0 1] blind EN√

multifarm OWL = [0 1] open+blindAR, CZ, CN, DE, EN, √

ES, FR, IT, NL, RU, PTinteractive OWL =, <= [0 1] open EN

instance OWL = [0 1] open(+blind) EN(+IT)√

process model OWL <= [0 1] open+blind EN√

Table 1. Characteristics of the test cases (open evaluation is made with already published refer-ence alignments and blind evaluation is made by organisers from reference alignments unknownto the participants).

items/instances expressed in the form of OWL Aboxes. Three independent tasksare defined:SABINE: The task is articulated in two sub-tasks called inter-lingual mapping and

data linking. Both sub-tasks are based on OWL ontologies containing topics asinstances of the class “Topic”. In inter-lingual mapping, two ontologies aregiven, one containing topics in the English language and one containing topicsin the Italian language. The goal is to discover mappings between English andItalian topics. In data linking, the goal is to discover the DBpedia entity whichbetter corresponds to each topic belonging to a source ontology.

SYNTHETIC: The task is articulated in two sub-tasks called UOBM and SPIM-BENCH. In UOBM, the goal is to recognize when two OWL instances be-longing to different data sets, i.e., ontologies, describe the same individual. InSPIMBENCH, the goal is to determine when two OWL instances describe thesame Creative Work. Data Sets are produced by altering a set of original data.

DOREMUS: The DOREMUS task contains real world data coming from theFrench National Library (BnF) and the Philharmonie de Paris (PP). Data areabout classical music work and follow the DOREMUS model (one single vo-cabulary for both datasets). Three sub-tasks are defined called nine hetero-geneities, four heterogeneities, and false-positive trap characterized by differ-ent degrees of heterogeneity in work descriptions.

Process Model Matching (§11): The track is concerned with the application of ontol-ogy matching techniques to the problem of matching process models. It is basedon a data set used in the Process Model Matching Campaign 2015 [3], which hasbeen converted to an ontological representation. The data set contains nine processmodels which represent the application process for a master program of Germanuniversities as well as reference alignments between all pairs of models.

Table 1 summarises the variation in the proposed test cases.

2.2 The SEALS client

Since 2011, tool developers had to implement a simple interface and to wrap their toolsin a predefined way including all required libraries and resources. A tutorial for tool

Page 5: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

wrapping was provided to the participants, describing how to wrap a tool and how touse the SEALS client to run a full evaluation locally. This client is then executed bythe track organisers to run the evaluation. This approach ensures the reproducibility andcomparability of the results of all systems.

2.3 Preparatory phase

Ontologies to be matched and (where applicable) reference alignments have been pro-vided in advance during the period between June 1st and June 30th, 2016. This gavepotential participants the occasion to send observations, bug corrections, remarks andother test cases to the organisers. The goal of this preparatory period is to ensure thatthe delivered tests make sense to the participants. The final test base was released onJuly 15th, 2016. The (open) data sets did not evolve after that.

2.4 Execution phase

During the execution phase, participants used their systems to automatically match thetest case ontologies. In most cases, ontologies are described in OWL-DL and serialisedin the RDF/XML format [11]. Participants can self-evaluate their results either by com-paring their output with reference alignments or by using the SEALS client to computeprecision and recall. They can tune their systems with respect to the non blind evalua-tion as long as the rules published on the OAEI web site are satisfied. This phase hasbeen conducted between July 15th and August 31st, 2016. Unlike previous years, werequested a mandatory registration of systems and a preliminary evaluation of wrappedsystems by July 31st. This reduced the cost of debugging systems with respect to issueswith the SEALS client during the Evaluation phase as it happened in the past.

2.5 Evaluation phase

Participants were required to submit their wrapped tools by August 31st, 2016. Toolswere then tested by the organisers and minor problems were reported to some tooldevelopers, who were given the opportunity to fix their tools and resubmit them.

Initial results were provided directly to the participants between September 23rd andOctober 15th, 2016. The final results for most tracks were published on the respectivepages of the OAEI website by October 15th, although some tracks were delayed.

The standard evaluation measures are usually precision and recall computed againstthe reference alignments. More details on the evaluation are given in the sections forthe test cases.

2.6 Comments on the execution

Following the recent trend, the number of participating systems has remained approx-imately constant at slightly over 20 (see Figure 1). This year was no exception, as wecounted 21 participating systems (out of 30 registered systems). Remarkably, partic-ipating systems have changed considerably between editions, and new systems keepemerging. For example, this year 10 systems had not participated in any of the previ-ous OAEI campaigns. The list of participants is summarised in Table 2. Note that somesystems were also evaluated with different versions and configurations as requested bydevelopers (see test case sections for details).

Page 6: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Fig. 1. Number of systems participating in OAEI per year.

3 BenchmarkThe goal of the benchmark data set is to provide a stable and detailed picture of eachalgorithm. For that purpose, algorithms are run on systematically generated test cases.

3.1 Test data

The systematic benchmark test set is built around a seed ontology and many variationsof it. Variations are artificially generated by discarding and modifying features from aseed ontology. Considered features are names of entities, comments, the specialisationhierarchy, instances, properties and classes. This test focuses on the characterisation ofthe behaviour of the tools rather than having them compete on real-life problems. Fulldescription of the systematic benchmark test set can be found on the OAEI web site.

Since OAEI 2011.5, the test sets are generated automatically from different seedontologies [20]. This year, we used two ontologies:

biblio The bibliography ontology used in the previous years which concerns biblio-graphic references and is inspired freely from BibTeX;

film A movie ontology developed in the MELODI team at IRIT (FilmographieV14). Ituses fragments in French and labels in French and English.

The characteristics of these ontologies are described in Table 3.The film data set was not available to participants when they submitted their sys-

tems. The tests were also blind for the organisers since we did not look into them beforerunning the systems.

The reference alignments are still restricted to named classes and properties and usethe “=” relation with confidence of 1.

4 https://www.irit.fr/recherches/MELODI/ontologies/FilmographieV1.owl

Page 7: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

System Alin

AM

LC

roLO

MC

roM

atch

erD

iSM

atch

DK

P-A

OM

DK

P-A

OM

-Lite

FCA

-Map

Lily

LogM

apLo

gMap

-Bio

LogM

apLt

LPH

OM

Lyam

++N

AIS

CP

heno

MF

Phe

noM

MP

heno

MP

RiM

OM

Sim

Cat

XM

ap

Tota

l=21

Confidence√ √ √ √ √ √ √ √ √ √ √ √ √

13

benchmarks√ √ √ √ √ √

6anatomy

√ √ √ √ √ √ √ √ √ √ √ √ √13

conference√ √ √ √ √ √ √ √ √ √ √ √ √

13largebio

√ √ √ √ √ √ √ √ √ √ √ √ √13

phenotype√ √ √ √ √ √ √ √ √ √ √

11multifarm

√ √ √ √ √ √ √7

interactive√ √ √ √

4process model

√ √ √ √4

instance√ √ √ √

4

total 9 1 4 1 4 4 4 4 9 3 6 4 5 1 1 1 1 1 1 7 77

Table 2. Participants and the state of their submissions. Confidence stands for the type of resultsreturned by a system: it is ticked when the confidence is a non-boolean value.

Test set biblio film

classes+prop 33+64 117+120instances 112 47entities 209 284triples 1332 1717

Table 3. Characteristics of the two seed ontologies used in benchmarks.

Page 8: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

3.2 Results

In order to avoid the discrepancy of last year, all systems were run in the most simplehomogeneous setting. So, this year, we can write anew: All tests have been run entirelyin the same conditions with the same strict protocol.

Evaluations were run on a Debian Linux virtual machine configured with four pro-cessors and 8GB of RAM running under a Dell PowerEdge T610 with 2*Intel XeonQuad Core 2.26GHz E5607 processors and 32GB of RAM, under Linux ProxMox 2(Debian). All matchers where run under the SEALS client using Java 1.8 and a maxi-mum heap size of 8GB.

As a result, many systems were not able to properly match the benchmark. Evalua-tors availability is not unbounded and it was not possible to pay attention to each systemas much as necessary.

Participation From the 21 systems participating to OAEI this year, only 10 systemswere providing results for this track. Several of these systems encountered problems:

However we encountered problems with one very slow matcher (LogMapBio) thathas been run anyway. RiMOM did not terminate, but was able to provide (empty) align-ments for biblio, not for film. No timeout was explicitly set.

Reported figures are the average of 5 runs. As has already been shown in [20], thereis not much variance in compliance measures across runs.Compliance Table 4 synthesises the results obtained by matchers.

biblio filmMatcher Prec. F-m. Rec. Prec. F-m. Rec.

edna .35(.58) .41(.54) .51(.50) .43 (.68) .47 (.58) .50 (.50)AML 1.0 .38 .24 1.0 .32 .20

CroMatcher .96 (.60) .89 (.54) .83 (.50) NaNLily .97 (.45) .89 (.40) .83 (.36) .97 (.39) .81 (.31) .70 (.26)

LogMap .93 (.90) .55 (.53) .39 (.37) .83 (.79) .13 (.12) .07 (.06)LogMapLt .43 .46 .50 .62 .51 .44PhenoMF .03 .01 .01 .03 .01 .01PhenoMM .03 .01 .01 .03 .01 .01PhenoMP .02 .01 .01 .03 .01 .01

XMap .95 (.98) .56 (.57) .40 (.40) .78 (.84) .60 (.62) .49 (.49)LogMapBio .48 (.48) .32 (.30) .24 (.22) .59 (.58) .07 (.06) .03 (.03)

Table 4. Aggregated benchmark results: Harmonic means of precision, F-measure and recall,along with their confidence-weighted values.

Systems that participated previously (AML, CroMatcher, Lily, LogMap, LogMapLite,XMap) still obtain the best results with Lily and CroMatcher still achieving an impressive.89 F-measure (against .90 and .88 last year). They combine very high precision (.96and .97) with high recall (.83). The PhenoXX suite of systems return huge but pooralignments. It is surprising that some of the systems (AML, LogMapLite) do not clearlyoutperform edna (our edit distance baseline).

On the film data set (which was not known from the participants when submittingtheir systems, and actually have been generated afterwards), the results of biblio are

Page 9: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

fully confirmed: (1) those system able to return results were still able to do it besidesCroMatcher and those unable, were still not able; (2) the order between these systemsand their performances are commensurate. Point (1) shows that these are robust sys-tems. Point (2) shows that the performances of these system are consistent across datasets, hence we are indeed measuring something. However, (2) has for exception LogMapand LogMapBio whose precision is roughly preserved but whose recall dramaticallydrops. A tentative explanation is that film contains many labels in French and these twosystems rely too much on WordNet. Anyway, these and CroMatcher seem to show someoverfit to biblio.

Polarity Besides LogMapLite, all systems have higher precision than recall as usualand usually very high precision as shown on the triangle graph for biblio (Figure 2).This can be compared with last year.

R=1.

R=.9

R=.8

R=.7

R=.6

R=.5

R=.4

R=.3

R=.2

R=.1

P=1.

P=.9

P=.8

P=.7

P=.6

P=.5

P=.4

P=.3

P=.2

P=.1

F=0.5

F=0.6

F=0.7

F=0.8

F=0.9

recall precision

refalign (b/f)

AML (b)

CroMatcher (b)

Lily (b)

LogMap (b)XMap (b)

AML (f)

Lily (f)

LogMapLite (f)

XMap (f)

Fig. 2. Triangle view on the benchmark data sets (biblio=(b), film=(f), run 5, non present systemshave too low F-measure, below .5).

The precision/recall graph (Figure 3) confirms that, as usual, there are a level ofrecall unreachable by any system and this is where some of them go to catch their goodF-measure.

Concerning confidence-weighted measures, there are two types of systems: those(CroMatcher, Lily) which obviously threshold their results but keep low confidence val-ues and those (LogMap, XMap, LogMapBio) which provide relatively faithful measures.The former shows a strong degradation of the measured values while the latters resist

Page 10: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

recall0. 1.0.

precision

1.

refalign1.00

edna0.50

AML0.24

XMap0.40

CroMatcher0.81

Lily0.82

LogMap0.38

LogMapLite0.50

LogMapBio0.19

PhenoMF0.01

PhenoMM0.01

PhenoMP0.01

Fig. 3. Precision/recall plots on biblio.

very well with XMap even improving its score. This measure which is supposed to re-ward systems able to provide accurate confidence values is beneficial to these faithfulsystems.

Speed Beside LogMapBio which uses alignment repositories on the web to findmatches, all matchers do the task in less than 40 min (for biblio and 12h for film).There is still a large discrepancy between matchers concerning the time spent from lessthan two minutes for LogMapLite, AML and XMap to nearly two hours for LogMapBio(on biblio).

biblio filmMatcher time stdev F-m./s. time stdev F-m./s.

AML 120 ±13% .32 183 ±1% .17CroMatcher 1100 ±3% .08 NaN

Lily 2211 ±1% .04 2797 ±1% .03LogMap 194 ±5% .28 40609 ±33% .00

LogMapLt 96 ±10% .48 116 ±0% .44PhenoMF 1632 ±8% .00 1798 ±7% .00PhenoMM 1743 ±7% .00 1909 ±7% .00PhenoMP 1833 ±7% .00 1835 ±7% .00

XMap 123 ±9% .46 2981 ±21% .02LogMapBio 54439 ±6% .00 193763419 ±32% .00

Table 5. Aggregated benchmark results: Time (in second), standard deviation on time and pointsof F-measure per second spent on the three data sets.

Table 5 provides the average time, time standard deviation and 1/100e F-measurepoint provided per second by matchers. The F-measure point provided per second shows

Page 11: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

that efficient matchers are, like two years ago, LogMapLite and XMap followed by AMLand LogMap. The correlation between time and F-measure only holds for these systems.

Time taken by systems is, for most of them, far larger on film than biblio and thedeviation from average increased as well.

3.3 Conclusions

This year, there is no increase or decrease of the performance of the best matchers whichare roughly the same as previous years. Precision is still preferred to recall by the bestsystems. It seems difficult to other matchers to catch up both in terms of robustness andperformances. This confirms the trend observed last year.

4 AnatomyThe anatomy test case confronts matchers with a specific type of ontologies from thebiomedical domain. We focus on two fragments of biomedical ontologies which de-scribe the human anatomy5 and the anatomy of the mouse6. This data set has been usedsince 2007 with some improvements over the years.

4.1 Experimental Setting

We conducted experiments by executing each system in its standard setting and wecompare precision, recall, F-measure and recall+. The measure recall+ indicates theamount of detected non-trivial correspondences. The matched entities in a non-trivialcorrespondence do not have the same normalised label. The approach that generatesonly trivial correspondences is depicted as baseline StringEquiv in the following section.

We ran the systems on a server with 3.46 GHz (6 cores) and 8GB RAM allocatedto each matching system. Further, we used the SEALS client to execute our evaluation.However, we slightly changed the way precision and recall are computed, i.e., the resultsgenerated by the SEALS client vary in some cases by 0.5% compared to the resultspresented below. In particular, we removed trivial correspondences in the oboInOwlnamespace like:

http://...oboInOwl#Synonym = http://...oboInOwl#Synonym

as well as correspondences expressing relations different from equivalence. Using thePellet reasoner we also checked whether the generated alignment is coherent, i.e., thatthere are no unsatisfiable classes when the ontologies are merged with the alignment.

4.2 Results

Table 6 reports all the 13 participating systems that could generate an alignment. Asprevious years some of the systems participated with different versions. LogMap partic-ipated with LogMap, LogMapBio and a lightweight version LogMapLite that uses onlysome core components. Similarly, DKP-AOM also participated with two versions, DKP-AOM and DKP-AOM-Lite. Several systems participate in the anatomy track for the firsttime. These are Alin, FCA Map, DLPHOM and LYAM. There are also systems havingbeen participant for several years in a row. LogMap is a constant participant since 2011.

5 http://www.cancer.gov/cancertopics/cancerlibrary/terminologyresources/

6 http://www.informatics.jax.org/searches/AMA_form.shtml

Page 12: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

AML and XMap joined the track in 2013. DKP-AOM, Lily and CroMatcher participatefor the second year in a row in this track. Lily participated in the track back in 2011.CroMatcher participated in 2013 but did not produce an alignment within the giventime frame. Thus, this year we have 10 different systems (not counting different ver-sions) which generated an alignment. For more details, we refer the reader to the paperspresenting the systems.

Matcher Runtime Size Precision F-measure Recall Recall+ Coherent

AML 47 1493 0.95 0.943 0.936 0.832√

CroMatcher 573 1442 0.949 0.925 0.902 0.773 -XMap 45 1413 0.929 0.896 0.865 0.647

LogMapBio 758 1531 0.888 0.892 0.896 0.728√

FCA Map 117 1361 0.932 0.882 0.837 0.578 -LogMap 24 1397 0.918 0.88 0.846 0.593

LYAM 799 1539 0.863 0.869 0.876 0.682 -Lily 272 1382 0.87 0.83 0.794 0.515 -LogMapLite 20 1147 0.962 0.828 0.728 0.288 -StringEquiv - 946 0.997 0.766 0.622 0.000 -LPHOM 1601 1555 0.709 0.718 0.727 0.497 -Alin 306 510 0.996 0.501 0.335 0.0

DKP-AOM-Lite 372 207 0.99 0.238 0.135 0.0√

DKP-AOM 379 207 0.99 0.238 0.135 0.0√

Table 6. Comparison, ordered by F-measure, against the reference alignment, runtime is mea-sured in seconds, the “size” column refers to the number of correspondences in the generatedalignment.

Unlike the last two editions of the track when 6 systems generated an alignment inless than 100 seconds, this year only 4 of them were able to complete the alignmenttask in this time frame. These are AML, XMap, LogMap and LogMapLite. Similarly tothe last 4 years LogMapLite has the shortest runtime, followed by LogMap, XMap andAML. Depending on the specific version of the systems, they require between 20 and 50seconds to match the ontologies. The table shows that there is no correlation betweenquality of the generated alignment in terms of precision and recall and required runtime.This result has also been observed in previous OAEI campaigns.

The table also shows the results for precision, recall and F-measure. In terms ofF-measure, the top 5 ranked systems are AML, CroMatcher, XMap, LogMapBio andFCA Map. LogMap is sixth with a F-measure very close to FCA Map. All the long-termparticipants in the track showed comparable results (in term or F-measure) to their lastyear’s results and at least as good as the results of the best systems in OAEI 2007-2010.LogMap and XMap generated the same number of correspondences in their alignment(XMap generated one correspondence more). AML and LogMapBio generated a slightlydifferent number—16 correspondences more for AML and 18 less for LogMapBio.

The results for the DKP-AOM systems are identical this year; by contrast, last yearthe lite version performed significantly better in terms of the observed measures. WhileLily had improved its 2015 results in comparison to 2011 (precision: from 0.814 to

Page 13: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

0.870, recall: from 0.734 to 0.793, and F-measure: from 0.772 to 0.830), this yearit performed similarly to last year. CroMatcher improved its results in comparison tolast year. Out of all systems participating in the anatomy track CroMatcher showed thelargest improvement in the observed measures in comparison to its values from theprevious edition of the track.

Comparing the F-measures of the new systems, FCA Map (0.882) scored very closeto one of the tracks’ long-term participants LogMap. Another of the new systems—LYAM—also achieved a good F-measure (0.869) which ranked sixth. As for the othertwo systems, LPHOM achieved a slightly lower F-measure than the baseline (StringE-quiv) whereas Alin was considerably below the baseline.

This year, 9 out of 13 systems achieved an F-measure higher than the baseline whichis based on (normalised) string equivalence (StringEquiv in the table). This is a slightlybetter result (percentage-wise) than last year’s (9 out of 15) and similar to 2014’s (7 outof 10). Two of the new participants in the track and the two DKP-AOM systems achievedan F-measure lower than the baseline. LPHOM scored under the StringEquiv baseline butat the same time it is the system that produced the highest number of correspondences.Its precision is significantly lower than the other three systems which scored under thebaseline and generated only trivial correspondences.

This year seven systems produced coherent alignments which is comparable to thelast two years, when 7 out of 15 and 5 out of 10 systems achieved this. From the fivebest systems only FCA Map produced an incoherent alignment.

4.3 Conclusions

Like for OAEI in general, the number of participating systems in the anatomy trackthis year was lower than in 2015 and 2013 but higher than in 2014, and there was acombination of newly-joined systems and long-term participants.

The systems that participated in the previous edition scored similarly to their pre-vious results, indicating that no substantial developments were made with regard tothis track. Of the newly-joined systems, (FCA Map and LYAM) ranked 4th and 6th withrespect to the F-measure.

5 ConferenceThe conference test case requires matching several moderately expressive ontologiesfrom the conference organisation domain.

5.1 Test data

The data set consists of 16 ontologies in the domain of organising conferences. Theseontologies have been developed within the OntoFarm project7.

The main features of this test case are:– Generally understandable domain. Most ontology engineers are familiar with or-

ganising conferences. Therefore, they can create their own ontologies as well asevaluate the alignments among their concepts with enough erudition.

– Independence of ontologies. Ontologies were developed independently and basedon different resources, they thus capture the issues in organising conferences fromdifferent points of view and with different terminologies.

7 http://owl.vse.cz:8080/ontofarm/

Page 14: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

– Relative richness in axioms. Most ontologies were equipped with OWL DL axiomsof various kinds; this opens a way to use semantic matchers.

Ontologies differ in their numbers of classes and properties, in expressivity, but alsoin underlying resources.

5.2 Results

We provide results in terms of F-measure, comparison with baseline matchers and re-sults from previous OAEI editions and precision/recall triangular graph based on sharpreference alignments. This year we can provide comparison between OAEI editionsof results based on the uncertain version of reference alignment and on violations ofconsistency and conservativity principles.

Evaluation based on sharp reference alignments We evaluated the results of partic-ipants against blind reference alignments (labelled as rar2). This includes all pairwisecombinations between 7 different ontologies, i.e., 21 alignments.

These reference alignments have been made in two steps. First, we have generatedthem as a transitive closure computed on the original reference alignments. In order toobtain a coherent result, conflicting correspondences, i.e., those causing unsatisfiabil-ity, have been manually inspected and removed by evaluators. The resulting referencealignments are labelled as ra2. Second, we detected violations of conservativity us-ing the approach from [39] and resolved them by an evaluator. The resulting referencealignments are labelled as rar2. As a result, the degree of correctness and completenessof the new reference alignments is probably slightly better than for the old one. How-ever, the differences are relatively limited. Whereas the new reference alignments arenot open, the old reference alignments (labeled as ra1 on the conference web page) areavailable. These represent close approximations of the new ones.

Table 7 shows the results of all participants with regard to the reference alignmentrar2. F0.5-measure, F1-measure and F2-measure are computed for the threshold thatprovides the highest average F1-measure. F1 is the harmonic mean of precision andrecall where both are equally weighted; F2 weights recall higher than precision andF0.5 weights precision higher than recall. The matchers shown in the table are orderedaccording to their highest average F1-measure. We employed two baseline matchers.edna (string edit distance matcher) is used within the benchmark test case and withregard to performance it is very similar as the previously used baseline2 in the con-ference track; StringEquiv is used within the anatomy test case. This year these base-lines divide matchers into two performance groups. The first group consists of match-ers (CroMatcher, AML, LogMap, XMap, LogMapBio, FCA Map, DKP-AOM, NAISC andLogMapLite) having better (or the same) results than both baselines in terms of highestaverage F1-measure. Other matchers (Lily, LPHOM, Alin and LYAM) performed worsethan both baselines. The performance of all matchers (except LYAM) regarding theirprecision, recall and F1-measure is visualised in Figure 4. Matchers are represented assquares or triangles. Baselines are represented as circles.

Further, we evaluated the performance of matchers separately on classes and prop-erties. We compared the position of tools within overall performance groups and withinonly classes and only properties performance groups. We observed that while the posi-tion of matchers changed slightly in overall performance groups in comparison with

Page 15: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Matcher Prec. F0.5-m. F1-m. F2-m. Rec. Inc.Align. Conser.V. Consist.V.

CroMatcher 0.74 0.72 0.69 0.67 0.65 8 98 25AML 0.78 0.74 0.69 0.65 0.62 0 52 0

LogMap 0.77 0.72 0.66 0.6 0.57 0 30 0XMap 0.8 0.73 0.65 0.59 0.55 0 23 0

LogMapBio 0.72 0.67 0.61 0.56 0.53 0 30 0FCA Map 0.71 0.65 0.59 0.53 0.5 12 46 150DKP-AOM 0.76 0.68 0.58 0.51 0.47 0 35 0

NAISC 0.77 0.67 0.57 0.49 0.45 20 321 701edna 0.74 0.66 0.56 0.49 0.45

LogMapLite 0.68 0.62 0.56 0.5 0.47 6 99 81StringEquiv 0.76 0.65 0.53 0.45 0.41

Lily 0.54 0.53 0.52 0.51 0.5 13 148 167LPHOM 0.69 0.57 0.46 0.38 0.34 0 0 0

Alin 0.87 0.59 0.4 0.3 0.26 0 0 0LYAM 0.4 0.31 0.23 0.18 0.16 1 75 3

Table 7. The highest average F[0.5|1|2]-measure and their corresponding precision and recall foreach matcher with its F1-optimal threshold (ordered by F1-measure). Inc.Align. means numberof incoherent alignments. Conser.V. means total number of all conservativity principle violations.Consist.V. means total number of all consistency principle violations.

only classes performance groups, a couple of matchers (DKP-AOM and FCA Map)worsen their position from overall performance groups with regard to their positionin only properties performance groups due to the fact that they do not match proper-ties at all (Alin and Lily also fall into this category). More details about these evaluationmodalities are on the conference web page.

Comparison with previous years with regard to rar2 Seven matchers also participatedin this test case in OAEI 2015. The largest improvement was achieved by CroMatcher(precision increased from .57 to .74 and recall increased from .47 to .65).

Evaluation based on uncertain version of reference alignments The confidence val-ues of all matches in the sharp reference alignments for the conference track are all 1.0.For the uncertain version of this track, the confidence value of a match has been setequal to the percentage of a group of people who agreed with the match in question(this uncertain version is based on the reference alignment labeled ra1). One key thingto note is that the group was only asked to validate matches that were already present inthe existing reference alignments – so some matches had their confidence value reducedfrom 1.0 to a number near 0, but no new match was added.

There are two ways that we can evaluate matchers according to these “uncertain”reference alignments, which we refer to as discrete and continuous. The discrete evalu-ation considers any match in the reference alignment with a confidence value of 0.5 orgreater to be fully correct and those with a confidence less than 0.5 to be fully incorrect.Similarly, a matcher’s match is considered a “yes” if the confidence value is greater thanor equal to the matcher’s threshold and a “no” otherwise. In essence, this is the same asthe “sharp” evaluation approach, except that some matches have been removed becauseless than half of the crowdsourcing group agreed with them. The continuous evaluation

Page 16: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

rec=1.0 rec=.8 rec=.6 pre=1.0pre=.8pre=.6

F1-measure=0.5

F1-measure=0.6

F1-measure=0.7

Alin

AML

CroMatcher

DKP-AOMFCA-Map

Lily

LogMap

LogMapBio

LogMapLite

LPHOM

NAISCXMap

ednaStringEquiv

Fig. 4. Precision/recall triangular graph for the conference test case. Dotted lines depict levelof precision/recall while values of F1-measure are depicted by areas bordered by correspondinglines F1-measure=0.[5|6|7].

strategy penalises a matcher more if it misses a match on which most people agree thanif it misses a more controversial match. For instance, if A ≡ B with a confidence of0.85 in the reference alignment and a matcher gives that correspondence a confidence of0.40, then that is counted as 0.85×0.40 = 0.34 of a true positive and 0.85–0.40 = 0.45of a false negative.

Out of the 13 matchers, three (DKP-AOM, FCA-Map and LogMapLite) use 1.0 as theconfidence values for all matches they identify. Two of the remaining ten (Alin and Cro-Matcher) have some variation in confidence values, though the majority are 1.0. The restof the systems have a fairly wide variation of confidence values. Last year, the majorityof these values were near the upper end of the [0,1] range. This year we see much morevariation in the average confidence values. For example, LopMap’s confidence valuesrange from 0.29 to 1.0 and average 0.78 whereas Lily’s range from 0.22 to 0.41 with anaverage of 0.33.

Discussion When comparing the performance of the matchers on the uncertain refer-ence alignments versus that on the sharp version, we see that in the discrete case allmatchers performed slightly better. Improvement in F-measure ranged from 1 to 8 per-centage points over the sharp reference alignment. This was driven by increased recall,which is a result of the presence of fewer “controversial” matches in the uncertain ver-sion of the reference alignment.

The performance of most matchers is similar regardless of whether a discrete orcontinuous evaluation methodology is used (provided that the threshold is optimised toachieve the highest possible F-measure in the discrete case). The primary exceptions

Page 17: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Sharp Discrete ContinuousMatcher Prec. F1-m. Rec. Prec. F1-m. Rec. Prec. F1-m. Rec.

Alin 0.89 0.40 0.26 0.89 0.48 0.33 0.89 0.48 0.33AML 0.84 0.74 0.66 0.79 0.78 0.77 0.80 0.77 0.74

CroMatcher 0.79 0.73 0.68 0.71 0.74 0.77 0.72 0.74 0.77DKP-AOM 0.82 0.62 0.50 0.78 0.67 0.59 0.78 0.69 0.61FCA-Map 0.75 0.61 0.52 0.71 0.66 0.61 0.69 0.65 0.61

Lily 0.59 0.56 0.53 0.59 0.57 0.56 0.59 0.32 0.22LogMap 0.82 0.69 0.59 0.78 0.73 0.68 0.80 0.67 0.57

LogMapBio 0.77 0.65 0.56 0.73 0.68 0.64 0.75 0.62 0.53LogMapLite 0.73 0.59 0.50 0.73 0.67 0.62 0.72 0.67 0.63

LPHOM 0.76 0.47 0.34 0.81 0.59 0.46 0.48 0.47 0.47Light YAM++ 0.38 0.22 0.15 0.35 0.24 0.18 0.09 0.15 0.38

NAISC 0.85 0.61 0.47 0.87 0.69 0.57 0.34 0.45 0.68XMap 0.85 0.68 0.57 0.81 0.73 0.67 0.83 0.74 0.67

Table 8. F-measure, precision, and recall of the different matchers when evaluated using the sharp(ra1), discrete uncertain and continuous uncertain metrics.

to this are Lily and NAISC. These matchers perform significantly worse when evaluatedusing the continuous version of the metrics. In Lily’s case, this is because it assigns verylow confidence values to some matches in which the labels are equivalent strings, whichmany crowdsourcers agreed with unless there was a compelling technical reason not to.This hurts recall, but using a low threshold value in the discrete version of the evaluationmetrics “hides” this problem. NAISC has the opposite issue: it assigns relatively highconfidence values to some matches that most people disagree with, such as “Assistant”and “Listener” (confidence value of 0.89). This hurts precision in the continuous case,but is taken care of by using a high threshold value (1.0) in the discrete case.

Seven matchers from this year also participated last year, and thus we are able tomake some comparisons over time. The F-measures of all matchers either held constantor improved when evaluated against the uncertain reference alignments. Most matchersmade modest gains (in the neighborhood of 1 to 6 percentage points). CroMatcher madethe largest improvement, and it is now the second-best matcher when evaluated in thisway. AgreementMakerLight remains the top performer.

Perhaps more importantly, the difference in the performance of most matchers be-tween the discrete and continuous evaluation has shrunk between this year and last year.This is an indication that more matchers are providing confidence values that reflect thedisagreement of humans on various matches.

Evaluation based on violations of consistency and conservativity principles Weperformed evaluation based on detection of conservativity and consistency violations[39,40]. The consistency principle states that correspondences should not lead to un-satisfiable classes in the merged ontology; the conservativity principle states that cor-respondences should not introduce new semantic relationships between concepts fromone of the input ontologies.

Table 7 summarises statistics per matcher. The table shows the number of unsat-isfiable TBoxes after the ontologies are merged (Inc. Align.), the total number of all

Page 18: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

conservativity principle violations within all alignments (Conser.V.) and the total num-ber of all consistency principle violations (Consist.V.).

Seven tools (Alin, AML, DKP-AOM, LogMap, LogMapBio, LPHOM and XMap) haveno consistency principle violations (in comparison to five last year) and one tool (LYAM)generated only one incoherent alignment. There are two tools (Alin, LPHOM) that haveno conservativity principle violations, and four more that have an average of onlyone conservativity principle violation (XMap, LogMap, LogMapBio and DKP-AOM). Weshould note that these conservativity principle violations can be “false positives” sincethe entailment in the aligned ontology can be correct although it was not derivable inthe single input ontologies.

In conclusion, this year eight matchers performed better than both baselines on ref-erence alignments which is not only consistent but also conservative. Further, this yearseven matchers generated coherent alignments (against five matchers last year and fourmatchers the year before). This confirms the trend that increasingly matchers gener-ate coherent alignments. Based on the uncertain reference alignments, more matchersare providing confidence values that reflect the disagreement of humans on variousmatches.

6 Large biomedical ontologies (largebio)The largebio test case requires to match the large and semantically rich biomedicalontologies FMA, SNOMED-CT, and NCI, which contain 78,989, 306,591 and 66,724classes, respectively.

6.1 Test data

The test case has been split into three matching problems: FMA-NCI, FMA-SNOMEDand SNOMED-NCI. Each matching problem has been further divided in 2 tasks involv-ing differently sized fragments of the input ontologies: small overlapping fragmentsversus whole ontologies (FMA and NCI) or large fragments (SNOMED-CT).

The UMLS Metathesaurus [5] has been selected as the basis for reference align-ments. UMLS is currently the most comprehensive effort for integrating independently-developed medical thesauri and ontologies, including FMA, SNOMED-CT, and NCI.

Although the standard UMLS distribution does not directly provide alignments (inthe sense of [21]) between the integrated ontologies, it is relatively straightforward toextract them from the information provided in the distribution files (see [25] for details).

It has been noticed, however, that although the creation of UMLS alignments com-bines expert assessment and auditing protocols they lead to a significant number oflogical inconsistencies when integrated with the corresponding source ontologies [25].

Since alignment coherence is an aspect of ontology matching that we aim to pro-mote, in previous editions we provided coherent reference alignments by refining theUMLS mappings using the Alcomo (alignment) debugging system [31], LogMap’s(alignment) repair facility [24], or both [26].

However, concerns were raised about the validity and fairness of applying auto-mated alignment repair techniques to make reference alignments coherent [35]. It isclear that using the original (incoherent) UMLS alignments would be penalising to on-tology matching systems that perform alignment repair. However, using automaticallyrepaired alignments would penalise systems that do not perform alignment repair and

Page 19: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

also systems that employ a repair strategy that differs from that used on the referencealignments [35].

Thus, as of the 2014 edition, we arrived at a compromising solution that should befair to all ontology matching systems. Instead of repairing the reference alignments asnormal, by removing correspondences, we flagged the incoherence-causing correspon-dences in the alignments by setting the relation to “?” (unknown). These “?” corre-spondences will neither be considered as positive nor as negative when evaluating theparticipating ontology matching systems, but will simply be ignored. This way, systemsthat do not perform alignment repair are not penalised for finding correspondences that(despite causing incoherences) may or may not be correct, and systems that do performalignment repair are not penalised for removing such correspondences.

To ensure that this solution was as fair as possible to all alignment repair strategies,we flagged as unknown all correspondences suppressed by any of Alcomo, LogMap orAML [?], as well as all correspondences suppressed from the reference alignments oflast year’s edition (using Alcomo and LogMap combined). Note that, we have used the(incomplete) repair modules of the above mentioned systems.

The flagged UMLS-based reference alignment for the OAEI 2016 campaign is sum-marised in Table 9.

Reference alignment “=” corresp. “?” corresp.

FMA-NCI 2,686 338FMA-SNOMED 6,026 2,982SNOMED-NCI 17,210 1,634

Table 9. Respective sizes of reference alignments

6.2 Evaluation setting, participation and success

We have run the evaluation on a Ubuntu Laptop with an Intel Core i7-4600U CPU @2.10GHz x 4 and allocating 15Gb of RAM. Precision, recall and F-measure have beencomputed with respect to the UMLS-based reference alignment. Systems have beenordered in terms of F-measure.

This year, out of the 21 systems participating in OAEI 2016, 13 were registered toparticipate in the largebio track, and 11 of these were able to cope with at least oneof the largebio tasks within a 2 hour time frame. However, only 6 systems were ableto complete more than one task, and only 4 systems completed all 6 tasks in this timeframe.

6.3 Background knowledge

Regarding the use of background knowledge, LogMap-Bio uses BioPortal as a mediatingontology provider, that is, it retrieves from BioPortal the most suitable top-10 ontologiesfor the matching task.

LogMap uses normalisations and spelling variants from the general (biomedical)purpose UMLS Lexicon.

AML has three sources of background knowledge which can be used as mediatorsbetween the input ontologies: the Uber Anatomy Ontology (Uberon), the Human Dis-ease Ontology (DOID) and the Medical Subject Headings (MeSH).

Page 20: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

SystemFMA-NCI FMA-SNOMED SNOMED-NCI

Average #Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

LogMapLite 1 10 2 18 8 18 10 6AML 35 72 98 166 537 376 214 6LogMap 10 80 60 433 177 699 243 6LogMapBio 1,712 1,188 1,180 2,156 3,757 4,322 2,386 6XMap 17 116 54 366 267 - 164 5FCA-Map 236 - 1,865 - - - 1,051 2Lily 699 - - - - - 699 1LYAM 1,043 - - - - - 1,043 1DKP-AOM 1,547 - - - - - 1,547 1DKP-AOM-Lite 1,698 - - - - - 1,698 1Alin 5,811 - - - - - 5,811 1# Systems 11 6 5 5 5 4 1,351 36

Table 10. System runtimes (s) and task completion.

XMap uses synonyms provided by the UMLS Metathesaurus. Note that matchingsystems using UMLS Metathesaurus as background knowledge will have a notableadvantage since the largebio reference alignment is also based on the UMLS Metathe-saurus.

6.4 Alignment coherence

Together with precision, recall, F-measure and run times we have also evaluated thecoherence of alignments. We report (1) the number of unsatisfiabilities when reasoningwith the input ontologies together with the computed alignments, and (2) the ratio ofunsatisfiable classes with respect to the size of the union of the input ontologies.

We have used the OWL 2 reasoner HermiT [33] to compute the number of unsatis-fiable classes. For the cases in which HermiT could not cope with the input ontologiesand the alignments (in less than 2 hours) we have provided a lower bound on the numberof unsatisfiable classes (indicated by ≥) using the OWL 2 EL reasoner ELK [27].

In this OAEI edition, only three distinct systems have shown alignment repair fa-cilities: AML, LogMap and its LogMap-Bio variant, and XMap (which reuses the repairtechniques from Alcomo [31]). Tables 11-12 (see last two columns) show that even themost precise alignment sets may lead to a huge number of unsatisfiable classes. Thisproves the importance of using techniques to assess the coherence of the generatedalignments if they are to be used in tasks involving reasoning. We encourage ontologymatching system developers to develop their own repair techniques or to use state-of-the-art techniques such as Alcomo [31], the repair module of LogMap (LogMap-Repair)[24] or the repair module of AML [?], which have worked well in practice [26,22].

6.5 Runtimes and task completion

Table 10 shows which systems were able to complete each of the matching tasks inless than 24 hours and the required computation times. Systems have been ordered withrespect to the number of completed tasks and the average time required to completethem. Times are reported in seconds.

The last column reports the number of tasks that a system could complete. Forexample, 8 system were able to complete all six tasks. The last row shows the number

Page 21: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Task 1: small FMA and NCI fragments

System Time (s) # Corresp.Scores Incoherence

Prec. F-m. Rec. Unsat. Degree

XMap8 17 2,649 0.98 0.94 0.90 2 0.019%FCA-Map 236 2,834 0.95 0.94 0.92 4,729 46.0%AML 35 2,691 0.96 0.93 0.90 2 0.019%LogMap 10 2,747 0.95 0.92 0.90 2 0.019%LogMapBio 1,712 2,817 0.94 0.92 0.91 2 0.019%LogMapLite 1 2,483 0.97 0.89 0.82 2,045 19.9%Average 1,164 2,677 0.85 0.80 0.78 2,434 23.7%LYAM 1,043 3,534 0.72 0.80 0.89 6,880 66.9%Lily 699 3,374 0.60 0.66 0.72 9,273 90.2%Alin 5,811 1,300 1.00 0.62 0.46 0 0.0%DKP-AOM-Lite 1,698 2,513 0.65 0.61 0.58 1,924 18.7%DKP-AOM 1,547 2,513 0.65 0.61 0.58 1,924 18.7%

Task 2: whole FMA and NCI ontologies

System Time (s) # Corresp.Scores Incoherence

Prec. F-m. Rec. Unsat. Degree

XMap8 116 2,681 0.90 0.87 0.85 9 0.006%AML 72 2,968 0.84 0.85 0.87 10 0.007%LogMap 80 2,693 0.85 0.83 0.80 9 0.006%LogMapBio 1,188 2,924 0.82 0.83 0.84 9 0.006%Average 293 2,948 0.82 0.82 0.84 5,303 3.6%LogMapLite 10 3,477 0.67 0.74 0.82 26,478 18.1%

Table 11. Results for the FMA-NCI matching problem.

of systems that could finish each of the tasks. The tasks involving SNOMED werealso harder with respect to both computation times and the number of systems thatcompleted the tasks.

6.6 Results for the FMA-NCI matching problem

Table 11 summarises the results for the tasks in the FMA-NCI matching problem.XMap and FCA-Map achieved the highest F-measure in Task 1; XMap and AML

in Task 2. Note however that the use of background knowledge based on the UMLSMetathesaurus has an important impact in the performance of XMap8. The use of back-ground knowledge led to an improvement in recall from LogMap-Bio over LogMap inboth tasks, but this came at the cost of precision, resulting in the two variants of thesystem having identical F-measures.

Note that the effectiveness of the systems decreased from Task 1 to Task 2. One rea-son for this is that with larger ontologies there are more plausible mapping candidates,and thus it is harder to attain both a high precision and a high recall. Another reason isthat the very scale of the problem constrains the matching strategies that systems can

8 Uses background knowledge based on the UMLS Metathesaurus which is the base of thelargebio reference alignments.

Page 22: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Task 3: small FMA and SNOMED fragments

System Time (s) # Corresp.Scores Incoherence

Prec. F-m. Rec. Unsat. Degree

XMap8 54 7,311 0.99 0.91 0.85 0 0.0%FCA-Map 1,865 7,649 0.94 0.86 0.80 14,603 61.8%AML 98 6,554 0.95 0.82 0.73 0 0.0%LogMapBio 1,180 6,357 0.94 0.80 0.70 1 0.004%LogMap 60 6,282 0.95 0.80 0.69 1 0.004%Average 543 5,966 0.96 0.76 0.66 2,562 10.8%LogMapLite 2 1,644 0.97 0.34 0.21 771 3.3%

Task 4: whole FMA ontology with SNOMED large fragment

System Time (s) # Corresp.Scores Incoherence

Prec. F-m. Rec. Unsat. Degree

XMap8 366 7,361 0.97 0.90 0.84 0 0.0%AML 166 6,571 0.88 0.77 0.69 0 0.0%LogMap 433 6,281 0.84 0.72 0.63 0 0.0%LogMapBio 2,156 6,520 0.81 0.71 0.64 0 0.0%Average 627 5,711 0.87 0.69 0.60 877 0.4%LogMapLite 18 1,822 0.85 0.34 0.21 4,389 2.2%

Table 12. Results for the FMA-SNOMED matching problem.

employ: AML for example, foregoes its matching algorithms that are computationallymore complex when handling very large ontologies, due to efficiency concerns.

The size of Task 2 proves a problem for several systems, which were unable tocomplete it within the allotted time: FCA-Map, LYAM, LiLy, Alin, DKP-AOM-Lite andDKP-AOM.

6.7 Results for the FMA-SNOMED matching problem

Table 12 summarises the results for the tasks in the FMA-SNOMED matching problem.XMap produced the best results in terms of both recall and F-measure in Task 3

and Task 4, but again, we must highlight that it uses background knowledge based onthe UMLS Metathesaurus. Among the other systems, FCA-Map and AML achieved thehighest F-measure in Tasks 3 and 4, respectively.

Overall, the quality of the results was lower than that observed in the FMA-NCImatching problem, as the matching problem is considerably larger. Indeed, several sys-tems were unable to complete even the smaller Task 3 within the allotted time: LYAM,LiLy, Alin, DKP-AOM-Lite and DKP-AOM.

Like in the FMA-NCI matching problem, the effectiveness of all systems decreasesas the ontology size increases from Task 3 to Task 4; FCA-Map could complete theformer but not the latter.

6.8 Results for the SNOMED-NCI matching problemTable 13 summarises the results for the tasks in the SNOMED-NCI matching problem.

AML achieved the best results in terms of both recall and F-measure in Tasks 5 and6, while LogMap and AML achieved the best results in terms of precision in Tasks 5 and6, respectively.

Page 23: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Task 5: small SNOMED and NCI fragments

System Time (s) # Corresp.Scores Incoherence

Prec. F-m. Rec. Unsat. Degree

AML 537 13,584 0.90 0.80 0.71 0 0.0%LogMap 177 12,371 0.92 0.77 0.66 0 0.0%LogMapBio 3,757 12,960 0.90 0.77 0.68 0 0.0%Average 949 13,302 0.91 0.75 0.64 ≥12,090 ≥16.1%XMap8 267 16,657 0.91 0.70 0.56 0 0.0%LogMapLite 8 10,942 0.89 0.69 0.57 ≥60,450 ≥80.4%

Task 6: whole NCI ontology with SNOMED large fragment

System Time (s) # Corresp.Scores Incoherence

Prec. F-m. Rec. Unsat. Degree

AML 376 13,175 0.90 0.77 0.67 ≥2 ≥0.001%LogMapBio 4,322 13,477 0.84 0.72 0.64 ≥6 ≥0.003%Average 1,353 12,942 0.85 0.72 0.62 37,667 19.9%LogMap 699 12,222 0.87 0.71 0.60 ≥4 ≥0.002%LogMapLite 18 12,894 0.80 0.66 0.57 ≥150,656 ≥79.5%

Table 13. Results for the SNOMED-NCI matching problem.

The overall performance of the systems was lower than in the FMA-SNOMED case,as this test case is even larger. As such, LiLy, DKP-AOM-Lite, DKP-AOM, FCA-Map, Alinand LYAM could not complete even the smaller Task 5 within 2 hours.

As in the previous matching problems, effectiveness decreases as the ontology sizeincreases, and XMap completed Task 5 but failed to complete Task 6 within the giventime frame.

Unlike in the FMA-NCI and FMA-SNOMED matching problems, the use of theUMLS Metathesaurus did not positively impact the performance of XMap, which ob-tained lower results than expected.

7 Disease and Phenotype Track (phenotype)The Pistoia Alliance Ontologies Mapping project team9 has organised this track basedon a real use case where it is required to find alignments between disease and pheno-type ontologies. Specifically, the selected ontologies are the Human Phenotype Ontol-ogy (HPO), the Mammalian Phenotype Ontology (MP), the Human Disease Ontology(DOID), and the Orphanet and Rare Diseases Ontology (ORDO).

7.1 Test data

There are two tasks in this track which comprise the pairwise alignment of:

– Human Phenotype Ontology (HPO) to Mammalian Phenotype Ontology (MP), and– Human Disease Ontology (DOID) to Orphanet and Rare Diseases Ontology

(ORDO).

The first task is important for translational science, since mammal model animalssuch as mice are widely used to study human diseases and their underlying genetics.

9 http://www.pistoiaalliance.org/projects/ontologies-mapping/

Page 24: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Mapping human phenotypes to mammalian phenotypes greatly facilitates the extrapo-lation from model animals to humans.

The second task is critical to ensure interoperability between two disease ontolo-gies: the more generic DOID and the more specific ORDO, in the domain of rare hu-man diseases. These are fundamental for understanding how genetic variation can causedisease.

Currently, mappings between these ontologies are mostly curated by bioinformaticsand disease experts who would benefit from the use of automated ontology matchingalgorithms into their workflows.

7.2 Evaluation setting

We have run the evaluation on a Ubuntu Laptop with an Intel Core i7-4600U CPU @2.10GHz x 4, allocating 15Gb of RAM.

In the OAEI 2016 phenotype track, 11 out of the 21 participating OAEI 2016 sys-tems have been able to cope with at least one of the tasks within a 24 hour time frame.

7.3 Evaluation criteria

Systems have been evaluated according to the following criteria:

– Semantic precision and recall with respect to silver standards automatically gen-erated by voting based on the outputs of all participating systems (we have usedvote=2 and vote=3)10.

– Semantic recall with respect to manually generated correspondences for three areas(carbohydrate, obesity and breast cancer).

– Manual assessment of a subset of the generated correspondences, specially the onesthat are not suggested by other systems, i.e., unique mapping.

We have used the OWL 2 reasoner HermiT to calculate the semantic precision andrecall. For example, a positive hit will mean that a mapping in the reference has been(explicitly) included in the output mappings or it can be inferred using reasoning fromthe input ontologies and the output mappings. The use of semantic values for preci-sion and recall also allowed us to provide a fair comparison for the systems PhenoMF,PhenoMM and PhenoMP which discover many subsumption mappings that are not ex-plicitly in the silver standards but may still be valid, i.e., inferred.

7.4 Use of background knowledge

LogMapBio uses BioPortal as a mediating ontology provider, that is, it retrieves fromBioPortal the most suitable top-10 ontologies for the matching task.

LogMap uses normalisations and spelling variants from the general (biomedical)purpose UMLS Lexicon.

AML has three sources of background knowledge which can be used as mediatorsbetween the input ontologies: the Uber Anatomy Ontology (Uberon), the Human Dis-ease Ontology (DOID) and the Medical Subject Headings (MeSH). Additionally, forthe HPO-MP test case, it uses the logical definitions of both ontologies, which define

10 When there are several systems of the same family, only one of them votes for avoiding bias.There still can be some bias through systems exploiting the same resource, e.g., UMLS.

Page 25: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Table 14. Results against silver standard with vote 2 and 3.

some of their classes as being a combination of an anatomic term, i.e., a class from ei-ther FMA or Uberon, with a phenotype modifier term, i.e., a class from the PhenotypicQuality Ontology.

XMap uses synonyms provided by the UMLS Metathesaurus.PhenoMM, PhenoMF and PhenoMP rely on different versions of the Phe-

nomeNET11 ontology with variable complexity.

7.5 Results

AML, FCA-Map, LogMap, LogMapBio, and PhenoMF produced the most complete re-sults according to both the automatic and manual evaluation.

Results against the silver standards The silver standards with vote 2 and 3 for HP-MPcontain 2,308 and 1,588 mappings, respectively; while for DOID-ORDO they include1,883 and 1,617 mappings respectively. Table 14 shows the results achieved by eachof the participating systems. We deliberately did not rank the systems since the silverstandards only allow us to assess how systems perform in comparison with one another.On the one hand, some of the mappings in the silver standard may be erroneous (falsepositives), as all it takes for that is that 2 or 3 systems agree on part of the erroneousmappings they find. On the other hand, the silver standard is not complete, as therewill likely be correct mappings that no system is able to find, and as we will show inthe manual evaluation, there are a number of mappings found by only one system (andtherefore not in the silver standard) which are correct. Nevertheless, the results withrespect to the silver standards do provide some insights into the performance of thesystems, which is why we highlighted in the table the 5 systems that produce resultsclosest to the silver standards: AML, FCA-Map, LogMap, LogMapBio, and PhenoMF.

Results against manually created mappings The manually generated mappings forthree areas (carbohydrate, obesity and breast cancer) include 29 mappings between HPand MP and 60 mappings between DOID and ORDO. Most of them representing sub-sumption relationships. Table 15 shows the results in terms of recall for each of the sys-tems. PhenoMF, PhenoMP and PhenoMM achieve very good results for HP-MP since

11 http://aber-owl.net/ontology/PhenomeNET

Page 26: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Table 15. Recall against manually created mappings.

Table 16. Unique mappings in the HP-MP task.

Table 17. Unique mappings in the DOID-ORDO task.

they discover a large number of subsumption mappings. However, for DOID-ORDOonly LogMap, LogMapBio and DisMatch discover some of the mappings in the curatedset.

Manual assessment of unique mappings Tables 16 and 17 show the precision resultsof the manual assessment of the unique mappings generated by the participating sys-tems. Unique mappings are correspondences that no other system (explicitly) providedin the output. We manually evaluated up to 30 mappings and we focused the assessmenton unique equivalence mappings.

For example DiSMatch’s output contains 291 unique mappings in the HP-MP task.The manual assessment revealed an (estimated) precision of 0.8333. In order to also takeinto account the number of unique mappings that a system is able to discover, Tables 16

Page 27: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

and 17 also include the positive and negative contribution of the unique mappings withrespect to the total unique mappings discovered by all participating systems.

8 MultiFarmThe MultiFarm data set [32] aims at evaluating the ability of matching systems to dealwith ontologies in different natural languages. This data set results from the translationof 7 ontologies from the conference track (cmt, conference, confOf, iasted, sigkdd,ekaw and edas) into 10 languages: Arabic, Chinese, Czech, Dutch, French, German,Italian, Portuguese, Russian, and Spanish. It is composed of 55 pairs of languages (see[32] for details on how the original MultiFarm data set was generated). For each pair,taking into account the alignment direction (cmten–confOfde and cmtde–confOfen, forinstance, as two distinct matching tasks), we have 49 matching tasks. The whole dataset is composed of 55× 49 matching tasks.

8.1 Experimental setting

Part of the data set is used for blind evaluation. This subset includes all matching tasksinvolving the edas and ekaw ontologies (resulting in 55 × 24 matching tasks). Thisyear, we have conducted a minimalistic evaluation and focused on the blind data set.Participants were able to test their systems on the available subset of matching tasks(open evaluation), available via the SEALS repository. The open subset covers 45× 25tasks. The open subset does not include Italian translations.

We distinguish two types of matching tasks: (i) those tasks where two differentontologies (cmt–confOf, for instance) have been translated into two different languages;and (ii) those tasks where the same ontology (cmt–cmt) has been translated into twodifferent languages. For the tasks of type (ii), good results are not directly related to theuse of specific techniques for dealing with cross-lingual ontologies, but on the abilityto exploit the identical structure of the ontologies.

For the sake of simplicity, we refer in the following to cross-lingual systems thoseimplementing cross-lingual matching strategies and non-cross-lingual systems thosewithout that feature.

This year, there were on 7 cross-lingual systems (out of 21): AML, CroLOM-Lite,IOMAP (renamed SimCat-Lite), LogMap, LPHOM, LYAM++, and XMap. Among thesesystems, only CroLOM-Lite and SimCat-Lite are specifically designed to this task. Thereader can refer to the OAEI papers for a detailed description of the strategies adoptedby each system.

The number of participants in fact increased with respect to the last campaign (5 in2015, 3 in 2014, 7 in 2013, and 7 in 2012).

Following the OAEI evaluation rules, all systems should be evaluated in all tracksalthough it is expected that some system produce bad or no results. For this track, weobserved different behaviours:

– CroMatcher and LYAM have experimented internal errors but were able to generatedalignments for less than half of the tasks;

– Alin and Lily have generated no errors but empty alignments for all tasks;– DKP-AOM and DKP-AOM-Lite were executed without errors but generated align-

ments for less than half of the tasks;

Page 28: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

– NAISC has mostly generated erroneous correspondences (for very few tasks) andRiMOM has basically generated correspondences between ontology annotations;

– Dedicated systems (FCA-Map, LogMapBio, PhenoMF, PhenoMM and PhenoMP)required more than 30 minutes (in average) for completing a single task and werenot evaluated;

In the following, we report the results for the systems dedicated to the task or thathave been able to provide non-empty alignments for some tasks. We count on 12 sys-tems (out of 21 participants).

8.2 Execution setting and runtime

The systems have been executed on a Ubuntu Linux machine configured with 8GB ofRAM running under a Intel Core CPU 2.00GHz x4 processors. All measurements arebased on a single run. As Table 18, we can observe large differences in the time requiredfor a system to complete the 55 x 24 matching tasks. Note as well that the concurrentaccess to the SEALS repositories during the evaluation period may have an impact inthe time required for completing the tasks.

8.3 Evaluation results

Table 18 presents the aggregated results for the matching tasks involving edas and ekawontologies. They have been computed using the Alignment API 4.6 and can slightlydiffer from those computed with the SEALS client. We haven’t applied any thresholdon the generated alignments. They are measured in terms of classical precision andrecall (future evaluations should include weighted and semantic metrics).

For both types of tasks, most systems favor precision to the detriment of recall.The exception is LPHOM that has generated huge sets of correspondences (togetherwith LYAM). As expected, (most) systems cross-lingual systems outperform the non-cross-lingual ones (the exceptions are LPHOM, LYAM and XMap, which have low per-formance for different reasons, i.e., many internal exceptions or poor ability to dealwith the specifics of the task). On the other hand, this year, many non-cross-lingualsystems dealing with matching at schema level have been executed with errors (Cro-Matcher, GA4OM) or were not able to deal at with the tasks (Alin, Lily, NAISC). Hence,their structural strategies could not be in fact evaluated (tasks of type ii). For both tasks,DKP-AOM and DKP-AOM-Lite have good performance in terms of precision but gener-ating few correspondences for less than half of the matching tasks.

In particular, for the tasks of type (i), AML outperforms all other systems in termsof F-measure, followed by LogMap, CroLOM-Lite and SimCat-Lite. However, LogMapoutperforms all systems in terms of precision, keeping a relatively good performancein terms of recall. For tasks of type (ii), AML decreases in performance with LogMapkeeping its good results and outperforming all systems, followed by CroLOM-Lite andSimCat-Lite.

With respect to the pairs of languages for test cases of type (i), for the sake ofbrevity, we do not present them here. The reader can refer to the OAEI results web pagefor detailed results for each of the 55 pairs. While non-cross-lingual systems were notable to deal with many pairs of languages (in particular those involving the ar, cn, andru languages), only 4 cross-lingual systems were able to deal with all pairs of languages

Page 29: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Type (i) – 22 tests per pair Type (ii) – 2 tests per pairSystem Time #pairs Size Prec. F-m. Rec. Size Prec. F-m. Rec.

AML 102 55 13.45 .51(.51) .45(.45) .40(.40) 39.99 .92(.92) .31(.31) .19(.19)

CroLOM-Lite 5501 55 8.56 .55(.55) .36(.36) .28(.28) 38.76 .89(.90) .40(.40) .26(.26)

LogMap 166 55 7.27 .71(.71) .37(.37) .26(.26) 52.81 .96(.96) .44(.44) .30(.30)

LPHOM 2497 34 84.22 .01(.02) .02(.04) .08(.08) 127.91 .13(.22) .13(.21) .13(.13)

LYAM 1367 24 177.30 .01(.00) .006(.01) .00(.00) 283.95 .03(.07) .02(.07) .03(.03)

SimCat-Lite 3938 54 7.07 .59(.60) .34(.35) .25(.25) 30.11 .90(.93) .33(.34) .21(.21)

XMap 134 31 3.93 .30(.54) .01(.01) .00(.00) .00 .00(.00) .00(.00) .00(.00)

CroMatcher 65 25 2.91 .29(.64) .004(.01) .00(.00) .00 .00(.00) .00(.00) .00(.00)

DKP-AOM 34 24 2.58 .42(.98) .03(.08) .02(.02) 4.37 .49(1.0) .01(.03) .01(.07)

DKP-AOM-Lite 35 24 2.58 .42(.98) .03(.08) .02(.02) 4.37 .49(1.0) .01(.03) .01(.01)

LogMapLite 21 55 1.16 .35(.35) .04(.09) .02(.02) 94.50 .01(.01) .01(.01) .01(.01)

NAISC 905 55 1.94 .00(.00) .00(.01) .00(.00) 1.84 .01(.01) .00(.01) .00(.01)

Table 18. MultiFarm aggregated results per matcher, for each type of matching task—differentontologies (i) and same ontologies (ii). Time is measured in minutes (for completing the 55× 24matching tasks). #pairs indicates the number of pairs of languages for which the tool is able togenerated (non empty) alignments. Size indicates the average of the number of generated corre-spondences for the tests where an (non empty) alignment has been generated. Two kinds of resultsare reported: those do not distinguishing empty and erroneous (or not generated) alignments andthose—indicated between parenthesis—considering only non empty generated alignments for apair of languages.

(AML, CroLOM-Lite, LogMap and SimCat-Lite). LPHOM has particularly experimentedproblems with the pairs involving cn and cz.

Non-cross-lingual systems take advantage of the similarities in the lexicon of somelanguages, in the absence of specific strategies. This can be corroborated by the factthat most of them generate their best F-measure for the pairs es-pt (followed by de-en,fr-pt and it-pt). This (expected) fact has been observed along the campaigns. Anotherpreviously observed behaviour is related to the fact that, although it is likely harder tofind correspondences between cz-pt than es-pt, for some non-cross-lingual systems thispair is present in their top F-measure.

Comparison with previous campaigns. The number of cross-lingual participants in-creased this year with respect to the last 2 campaigns (7 in 2016, 5 in 2015, 3 in 2014,7 in 2013 and 2012 and 3 in 2011.5). This year, 4 systems have also participated lastyear (AML, LogMap, LYAM, and XMap) and we count on 3 new systems (CroLOM-Lite,LPHOM, SimCat-Lite).

Comparing the results from last year, in terms F-measure and with respect to theblind evaluation (cases of type i), AML slightly decreases its performance (.45 in 2016and .47 in 2015). LogMap (and LogMap-Lite maintained its performance (.37), withXMap decreasing considerably in terms of recall but largely improving its executiontime. Newcomers, specifically dedicated to the task, (CroLOM-Lite) and (SimCat-Lite)obtained F-measure near to (LogMap).

With respect to non-cross-lingual systems, last year CroMatcher finished the taskwithout errors what explains its better performance, while DKP-AOM keeped the sameresults.

Page 30: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

8.4 Conclusion

From 21 participants, half of them have been evaluated in this track. While some cross-lingual systems were not able to fully deal with the difficulties of the task, some otherswere not able to complete many tests due to internal errors, what is also the case forsome non-cross-lingual systems.

In terms of performance, the F-measure for blind tests remains relatively stableacross campaigns. AML and LogMap keep their positions with respect to the previouscampaigns, followed this year by the new systems CroLOM-Lite and SimCat-Lite. Still,all systems privilege precision to the detriment of recall.

As expected, systems implementing specific methods for dealing with ontologies indifferent languages outperform non specific systems. Still, cross-lingual approaches aremainly based on translation strategies and the combination of other resources (such ascross-lingual links in Wikipedia or BabelNet) and strategies (machine learning, indirectalignment composition) remains underexploited. For most systems, the strategy consistsof integrating one translation step before the matching itself.

Finally, this year, a minimalistic evaluation has been conducted (results have notbeen reported for the open data set). Furthermore, systems should also be evaluatedusing weighted and semantic measures. Multilingual tasks should also be consideredand compared against cross-lingual settings.

9 Interactive matchingThe interactive matching track was organised at OAEI 2016 for the fourth time. Thegoal of this evaluation is to simulate interactive matching [34,13], where a human ex-pert is involved to validate correspondences found by the matching system. In the eval-uation, we look at how interacting with the user improves the matching results. Further,we look at how the results of the matching systems are influenced when the expertsmake mistakes. Currently, this track does not evaluate the user experience nor the userinterfaces of the systems [23].

9.1 Data sets

In this edition, we expanded the Interactive track and used data sets from four otherOAEI tracks: Anatomy (Section 4), Conference (Section 5), LargeBio (Section 6), andPhenotype (Section 7). For details on the data sets, please refer to their respective sec-tions.

9.2 Experimental setting

The Interactive track relies on the SEALS client’s oracle class to simulate user interac-tions. An interactive matching system can present a correspondence to the oracle, whichwill tell the system whether that correspondence is correct or wrong. This year we haveextended this functionality by allowing a user to present a collection of mappings si-multaneously to the oracle.

To simulate the possibility of user errors, the oracle can be set to reply with a givenerror probability (randomly, from a uniform distribution). We evaluated systems withfour different error rates: 0.0 (perfect oracle), 0.1, 0.2, and 0.3.

The evaluations of the Conference and Anatomy data sets were run on a server with3.46 GHz (6 cores) and 8GB RAM allocated to the matching systems. Each system was

Page 31: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

run ten times and the final result of a system for each error rate represents the average ofthese runs. This is the same configuration which was used in the non-interactive versionof the Anatomy track and runtimes in the interactive version of this track are thereforedirectly comparable. For the Conference data set with the ra1 alignment, we consideredmacro-average of precision and recall of different ontology pairs, while the number ofinteractions represent the total number of interactions in all tasks. Finally, the ten runsare averaged.

The Phenotype and LargeBio evaluation was run on a Ubuntu Laptop with an IntelCore i7-4600U CPU @ 2.10GHz x 4 and allocating 15Gb of RAM. Each system wasrun only once due to the time required to run some of the systems. Since errors arerandomly introduced we expect minor variations between runs.

9.3 Evaluation

The results are presented for each data set separately: Tables 19-22 and Figure 5 for theAnatomy data set, Tables 23-26 and Figure 6 for the Conference data set, Tables 27-30and Figures 7-8 for the Disease and Phenotype data set12, and Tables 35-42 and Figures-10 for the LargeBio data set13.

For the tables we present the following information (column names in parentheses).

– The running time of the systems (Time) in seconds.– The number of unsatisfiable classes resulting from the alignments computed as

detailed in Section 6 - only for the LargeBio data set.– The performance of the systems is measured using Precision (Prec.), Recall (Rec.)

and F-measure (F-m.) with respect to the fixed reference alignment. For theAnatomy track we also present Recall+ (Rec.+) as in Section 4.

– To be able to compare the systems with and without interaction we also providethe performance results from the original tracks in Precision non-interactive (Prec.non), Recall non-interactive (Rec. non), F-measure non-interactive (F-m. non) andRecall+ non-interactive (Rec.+ non). For the ease of reading the tables this infor-mation is duplicated for each table.

– When the oracle makes mistakes, the oracle uses essentially a modified referencealignment. The performance of the system with respect to this modified referencealignment is given in Precision oracle (Prec. oracle), Recall oracle (Rec. oracle)and F-measure oracle (F-m. oracle). We note that for a perfect oracle these valuesare the same as the Precision (Prec.), Recall (Rec.) and F-measure (F-m.) values,respectively.

– Total requests (Tot Reqs.) represents the number of distinct user interactions withthe tool, where each interaction can contain one or more correspondences that couldbe analysed simultaneously.

12 Alin could not complete any of the Phenotype tasks, while XMap did not request any userinteraction in the HP-MP data set and thus only participated de facto in the DOID-ORDO dataset.

13 We have used only the small FMA-NCI and SNOMED-NCI matching tasks of the LargeBiotrack (see Section 6) for interactive evaluation. Alin could only complete the small FMA-NCItask.

Page 32: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Fig. 5. Time intervals between requests to the user/oracle for the Anatomy data set (whiskers:Q1-1,5IQR, Q3+1,5IQR, IQR=Q3-Q1). The labels under the system names show the averagenumber of requests and the mean time between the requests for the ten runs.

– In distinct mappings (Dist. Mapps) the mappings that are not conflicting arecounted individually; and if more than three mappings are given, they are allcounted independently, regardless of whether they are conflicting.

– We provide the true positives (TP), true negatives (TN), false positives (FP) andfalse negatives (FN) regarding the distinct mapping requests.

– Finally, we provide the performance of the oracle in positive precision (Pos. Prec.)and negative precision (Neg. Prec.). These are the fraction of positive, repectively,negative answers given by the oracle that are correct. We note that for a perfectoracle these values are always equal to 1.

The figures show the time intervals between the questions to the user/oracle for thedifferent systems and error rates. Different runs are depicted with different colours.

9.4 Discussion

In this paper we provide our general observations and lessons learned. For more detailswe refer to the OAEI 2016 web site.

The different systems use different strategies for using the oracle. While LogMap,XMap and AML make use of user interactions exclusively in the post-matching stepsto filter their candidate mappings, Alin can also add new candidate mappings to its ini-

Page 33: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Tool

Tim

ePr

ec.

Rec

.F-

m.

Rec

.+Pr

ec.

non

Rec

.no

nF-

m.

non

Rec

.+N

on.

Prec

.or

acle

Rec

.or

acle

F-m

.or

acle

Tot.

Req

s.D

ist.

Map

psT

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

Alin

505

0.99

0.75

0.85

0.35

0.98

0.34

0.50

0.00

0.99

0.75

0.85

803

1221

626

594

00

1.00

1.00

AM

L48

0.97

0.95

0.96

0.86

0.95

0.94

0.94

0.83

0.97

0.95

0.96

241

240

5118

90

01.

001.

00Lo

gMap

270.

980.

850.

910.

600.

910.

850.

880.

590.

980.

850.

9159

059

028

730

30

01.

001.

00X

Map

490.

930.

870.

900.

650.

930.

870.

900.

650.

930.

870.

9035

355

300

01.

001.

00

Tabl

e19

.Ana

tom

yda

tase

t–pe

rfec

tora

cle

Tool

Tim

ePr

ec.

Rec

.F-

m.

Rec

.+Pr

ec.

non

Rec

.no

nF-

m.

non

Rec

.+N

on.

Prec

.or

acle

Rec

.or

acle

F-m

.or

acle

Tot.

Req

s.D

ist.

Map

psT

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

Alin

489

0.95

0.70

0.81

0.31

0.98

0.34

0.50

0.00

0.99

0.74

0.85

769

1123

554

451

5167

0.92

0.87

AM

L50

0.95

0.95

0.95

0.86

0.95

0.94

0.94

0.83

0.97

0.95

0.96

273

272

4719

423

60.

670.

97Lo

gMap

240.

960.

830.

890.

570.

910.

850.

880.

590.

960.

830.

8961

261

225

829

035

280.

880.

91X

Map

460.

930.

870.

900.

650.

930.

870.

900.

650.

930.

870.

9035

354.

227

.52.

70.

80.

610.

97

Tabl

e20

.Ana

tom

yda

tase

t–er

rorr

ate

0.1

Tool

Tim

ePr

ec.

Rec

.F-

m.

Rec

.+Pr

ec.

non

Rec

.no

nF-

m.

non

Rec

.+N

on.

Prec

.or

acle

Rec

.or

acle

F-m

.or

acle

Tot.

Req

s.D

ist.

Map

psT

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

Alin

481

0.91

0.66

0.77

0.27

0.98

0.34

0.50

0.00

0.99

0.74

0.85

750

1077

493

368

9412

10.

840.

75A

ML

480.

940.

940.

940.

850.

950.

940.

940.

830.

970.

950.

9630

029

946

193

4613

0.50

0.94

LogM

ap24

0.94

0.82

0.88

0.54

0.91

0.85

0.88

0.59

0.94

0.81

0.87

645

645

225

287

7061

0.76

0.82

XM

ap47

0.93

0.87

0.90

0.65

0.93

0.87

0.90

0.65

0.93

0.87

0.90

3535

4.3

25.1

5.4

0.7

0.45

0.97

Tabl

e21

.Ana

tom

yda

tase

t–er

rorr

ate

0.2

Tool

Tim

ePr

ec.

Rec

.F-

m.

Rec

.+Pr

ec.

non

Rec

.no

nF-

m.

non

Rec

.+N

on.

Prec

.or

acle

Rec

.or

acle

F-m

.or

acle

Tot.

Req

s.D

ist.

Map

psT

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

Alin

472

0.87

0.62

0.72

0.23

0.98

0.34

0.50

0.00

0.99

0.73

0.84

740

1058

430

311

134

182

0.76

0.63

AM

L47

0.93

0.94

0.93

0.84

0.95

0.94

0.94

0.83

0.97

0.95

0.96

308

307

4017

771

180.

360.

91Lo

gMap

240.

930.

820.

870.

540.

910.

850.

880.

590.

930.

800.

8665

065

020

225

610

684

0.66

0.75

XM

ap47

0.93

0.87

0.90

0.65

0.93

0.87

0.90

0.65

0.93

0.86

0.90

3535

3.1

21.8

8.9

1.9

0.27

0.92

Tabl

e22

.Ana

tom

yda

tase

t–er

rorr

ate

0.3

Page 34: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

ToolTim

ePrec.R

ec.F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec. N

eg.Prec.

Alin

1010.96

0.740.83

0.890.26

0.400.96

0.740.83

326574

144429

00

1.001.00

AM

L29

0.910.71

0.800.84

0.660.74

0.910.71

0.80271

27047

2230

01.00

1.00LogM

ap26

0.890.61

0.720.82

0.590.69

0.890.61

0.72142

14249

930

01.00

1.00X

Map

210.84

0.570.68

0.840.57

0.680.84

0.570.68

44

04

00

0.001.00

Table23.C

onferencedata

set–perfectoracle

ToolTim

ePrec.R

ec.F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec. N

eg.Prec.

Alin

1010.79

0.670.73

0.890.26

0.400.96

0.740.84

315557

124375

4215

0.750.96

AM

L30

0.850.70

0.770.84

0.660.74

0.920.73

0.82285

27951

20418

50.74

0.98LogM

ap26

0.850.60

0.700.82

0.590.69

0.860.59

0.70140

14045

8110

30.82

0.97X

Map

220.84

0.570.68

0.840.57

0.680.84

0.570.68

44

03.6

0.40

0.001.00

Table24.C

onferencedata

set–errorrate0.1

ToolTim

ePrec.R

ec.F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec. N

eg.Prec.

Alin

1000.67

0.620.64

0.890.26

0.400.96

0.750.84

303538

108321

8127

0.570.92

AM

L33

0.770.68

0.720.84

0.660.74

0.930.75

0.83290

27753

17042

110.56

0.94LogM

ap26

0.820.59

0.690.82

0.590.69

0.830.58

0.68143

14338

7518

100.68

0.88X

Map

210.84

0.570.68

0.840.57

0.680.84

0.570.68

44

03.2

0.80

0.001.00

Table25.C

onferencedata

set–errorrate0.2

ToolTim

ePrec.R

ec.F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec. N

eg.Prec.

Alin

990.57

0.570.57

0.890.26

0.400.97

0.770.86

303535

93279

12042

0.440.87

AM

L30

0.720.65

0.680.84

0.660.74

0.930.75

0.83284

26947

14358

200.45

0.88LogM

ap26

0.800.59

0.680.82

0.590.69

0.800.56

0.66144

14433

6728

150.54

0.82X

Map

220.84

0.570.68

0.840.57

0.680.84

0.570.68

44

02.9

1.10

0.001.00

Table26.C

onferencedata

set–errorrate0.3

Page 35: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Tool

Tim

ePr

ec.

Rec

.F-

m.

Prec

.no

nR

ec.

non

F-m

.no

nPr

ec.

orac

leR

ec.

orac

leF-

m.

orac

leTo

t.R

eqs.

Dis

t.M

apps

.T

PT

NFP

FNPo

s.Pr

ec.N

eg.

Prec

.A

ML

308

0.94

50.

899

0.92

10.

90.

851

0.87

50.

945

0.89

90.

921

388

388

192

196

00

11

LogM

ap32

90.

960.

960.

960.

755

0.95

70.

844

0.96

0.96

0.96

1,92

81,

928

551

1,37

70

01

1

Tabl

e27

.Phe

noty

pe:H

P-M

Pda

tase

t–pe

rfec

tora

cle

Tool

Tim

ePr

ec.

Rec

.F-

m.

Prec

.no

nR

ec.

non

F-m

.no

nPr

ec.

orac

leR

ec.

orac

leF-

m.

orac

leTo

t.R

eqs.

Dis

t.M

apps

.T

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

AM

L30

60.

932

0.88

60.

908

0.9

0.85

10.

875

0.94

50.

899

0.92

138

838

817

117

620

210.

895

0.89

3Lo

gMap

346

0.88

80.

932

0.90

90.

755

0.95

70.

844

0.91

20.

912

0.91

21,

891

1,89

149

81,

208

132

530.

790.

958

Tabl

e28

.Phe

noty

pe:H

P-M

Pda

tase

t–er

rorr

ate

0.1

Tool

Tim

ePr

ec.

Rec

.F-

m.

Prec

.no

nR

ec.

non

F-m

.no

nPr

ec.

orac

leR

ec.

orac

leF-

m.

orac

leTo

t.R

eqs.

Dis

t.M

apps

.T

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

AM

L30

90.

923

0.86

80.

895

0.9

0.85

10.

875

0.94

40.

894

0.91

835

835

814

414

032

420.

818

0.76

9Lo

gMap

367

0.83

60.

915

0.87

40.

755

0.95

70.

844

0.87

10.

871

0.87

11,

855

1,85

544

01,

042

262

111

0.62

70.

904

Tabl

e29

.Phe

noty

pe:H

P-M

Pda

tase

t–er

rorr

ate

0.2

Tool

Tim

ePr

ec.

Rec

.F-

m.

Prec

.no

nR

ec.

non

F-m

.no

nPr

ec.

orac

leR

ec.

orac

leF-

m.

orac

leTo

t.R

eqs.

Dis

t.M

apps

.T

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

AM

L29

90.

905

0.85

60.

880.

90.

851

0.87

50.

944

0.89

90.

921

390

390

124

138

5870

0.68

10.

663

LogM

ap26

30.

796

0.90

70.

848

0.75

50.

957

0.84

40.

830.

830.

831,

827

1,82

738

789

238

416

40.

502

0.84

5

Tabl

e30

.Phe

noty

pe:H

P-M

Pda

tase

t–er

rorr

ate

0.3

Page 36: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

ToolTim

ePrec.

Rec.

F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec. N

eg.Prec.

AM

L525

0.9290.993

0.960.824

0.990.899

0.9290.993

0.96413

413115

2980

01

1LogM

ap440

0.9940.972

0.9830.904

0.9320.918

0.9940.972

0.9831,602

1,602780

8220

01

1X

Map

2,3520.933

0.7140.809

0.9770.622

0.760.933

0.7140.809

1111

38

00

11

Table31.Phenotype:D

OID

-OR

DO

dataset–perfectoracle

ToolTim

ePrec.

Rec.

F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec.

Neg.

Prec.A

ML

5070.912

0.9880.948

0.8240.99

0.8990.93

0.9920.96

413413

108266

327

0.7710.974

LogMap

4920.949

0.9270.938

0.9040.932

0.9180.961

0.9390.95

1,6771,677

698815

8282

0.8950.909

XM

ap2,603

0.9310.713

0.8080.977

0.6220.76

0.9320.713

0.80811

113

71

00.75

1

Table32.Phenotype:D

OID

-OR

DO

dataset–errorrate

0.1

ToolTim

ePrec.

Rec.

F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec.

Neg.

Prec.A

ML

5130.899

0.9790.937

0.8240.99

0.8990.931

0.9920.961

413413

94242

5621

0.6270.92

LogMap

4280.906

0.910.908

0.9040.932

0.9180.913

0.8920.902

1,6991,699

621716

203159

0.7540.818

XM

ap2,302

0.9310.712

0.8070.977

0.6220.76

0.9320.713

0.80811

111

71

20.5

0.778

Table33.Phenotype:D

OID

-OR

DO

dataset–errorrate

0.2

ToolTim

ePrec.

Rec.

F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec.

Neg.

Prec.A

ML

5400.881

0.9750.926

0.8240.99

0.8990.933

0.9920.962

413413

88204

9328

0.4860.879

LogMap

4270.883

0.9040.893

0.9040.932

0.9180.864

0.8450.854

1,7601,760

555681

299225

0.650.752

XM

ap2,260

0.9310.712

0.8070.977

0.6220.76

0.9320.713

0.80811

111

71

20.5

0.778

Table34.Phenotype:D

OID

-OR

DO

dataset–errorrate

0.3

Page 37: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Tool

Tim

eU

nsat

.Pr

ec.

Rec

.F-

m.

Prec

.no

nR

ec.

non

F-m

.no

nPr

ec.

orac

leR

ec.

orac

leF-

m.

orac

leTo

t.R

eqs.

Dis

t.M

apps

.T

PT

NFP

FNPo

s.Pr

ec.N

eg.

Prec

.A

lin5,

859

20.

996

0.63

0.77

20.

995

0.45

50.

624

0.99

60.

630.

772

653

1,01

947

054

90

01

1A

ML

602

0.99

0.91

30.

950.

963

0.90

20.

932

0.99

0.91

30.

9544

944

721

723

00

01

1Lo

gMap

382

0.99

20.

901

0.94

40.

944

0.89

70.

920.

992

0.90

10.

944

1,13

11,

131

594

537

00

11

XM

ap50

20.

991

0.9

0.94

30.

977

0.90

10.

937

0.99

10.

90.

943

188

188

114

740

01

1

Tabl

e35

.Lar

geB

io:F

MA

-NC

Ism

alld

ata

set–

perf

ecto

racl

e

Tool

Tim

eU

nsat

.Pr

ec.

Rec

.F-

m.

Prec

.no

nR

ec.

non

F-m

.no

nPr

ec.

orac

leR

ec.

orac

leF-

m.

orac

leTo

t.R

eqs.

Dis

t.M

apps

.T

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

Alin

5,61

685

0.97

10.

614

0.75

20.

995

0.45

50.

624

0.99

60.

630.

772

629

932

426

420

4343

0.90

80.

907

AM

L61

222

0.98

0.90

80.

943

0.96

30.

902

0.93

20.

990.

914

0.95

497

484

224

219

2615

0.89

60.

936

LogM

ap39

20.

980.

881

0.92

80.

944

0.89

70.

920.

983

0.89

20.

935

1,20

91,

209

536

582

3358

0.94

20.

909

XM

ap51

20.

988

0.89

50.

939

0.97

70.

901

0.93

70.

990.

90.

943

187

187

100

684

150.

962

0.81

9

Tabl

e36

.Lar

geB

io:F

MA

-NC

Ism

alld

ata

set–

erro

rrat

e0.

1

Tool

Tim

eU

nsat

.Pr

ec.

Rec

.F-

m.

Prec

.no

nR

ec.

non

F-m

.no

nPr

ec.

orac

leR

ec.

orac

leF-

m.

orac

leTo

t.R

eqs.

Dis

t.M

apps

.T

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

Alin

5,39

815

20.

958

0.59

30.

733

0.99

50.

455

0.62

40.

996

0.62

40.

767

605

881

370

353

6395

0.85

50.

788

AM

L63

20.

974

0.89

40.

932

0.96

30.

902

0.93

20.

987

0.91

0.94

745

045

016

618

543

560.

794

0.76

8Lo

gMap

382

0.96

70.

874

0.91

80.

944

0.89

70.

920.

964

0.87

50.

917

1,24

71,

247

488

558

9510

60.

837

0.84

XM

ap58

20.

988

0.89

20.

938

0.97

70.

901

0.93

70.

990.

899

0.94

218

718

792

676

220.

939

0.75

3

Tabl

e37

.Lar

geB

io:F

MA

-NC

Ism

alld

ata

set–

erro

rrat

e0.

2

Tool

Tim

eU

nsat

.Pr

ec.

Rec

.F-

m.

Prec

.no

nR

ec.

non

F-m

.no

nPr

ec.

orac

leR

ec.

orac

leF-

m.

orac

leTo

t.R

eqs.

Dis

t.M

apps

.T

PT

NFP

FNPo

s.Pr

ec.

Neg

.Pr

ec.

Alin

5,34

791

0.93

70.

580.

716

0.99

50.

455

0.62

40.

996

0.62

30.

767

589

855

335

293

9912

80.

772

0.69

6A

ML

632

0.96

60.

894

0.92

90.

963

0.90

20.

932

0.98

10.

911

0.94

545

045

016

017

453

630.

751

0.73

4Lo

gMap

392

0.96

30.

872

0.91

50.

944

0.89

70.

920.

935

0.84

90.

891,

327

1,32

742

957

216

116

50.

727

0.77

6X

Map

532

0.98

50.

887

0.93

30.

977

0.90

10.

937

0.99

0.89

90.

942

188

188

8059

1435

0.85

10.

628

Tabl

e38

.Lar

geB

io:F

MA

-NC

Ism

alld

ata

set–

erro

rrat

e0.

3

Page 38: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

ToolTim

eU

nsat.Prec.

Rec.

F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec. N

eg.Prec.

AM

L730

00.972

0.7260.831

0.9040.713

0.7970.972

0.7260.831

2,7302,730

1,6571,073

00

11

LogMap

6280

0.9850.669

0.7970.922

0.6630.771

0.9850.669

0.7975,596

5,5963,742

1,8540

01

1X

Map

98435,869

0.9240.59

0.720.911

0.5640.697

0.9240.59

0.7211,932

11,68910,090

1,5990

01

1

Table39.L

argeBio:SN

OM

ED

-NC

Ismalldata

set–perfectoracle

ToolTim

eU

nsat.Prec.

Rec.

F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec.

Neg.

Prec.A

ML

7590

0.9670.717

0.8230.904

0.7130.797

0.9720.724

0.832,730

2,7301,495

97992

1640.942

0.857LogM

ap619

160.974

0.6510.78

0.9220.663

0.7710.971

0.6560.783

6,2016,201

3,3572,263

196385

0.9450.855

XM

ap957

35,4550.923

0.5910.721

0.9110.564

0.6970.84

0.5680.678

11,93111,694

9,0951,512

89998

0.990.602

Table40.L

argeBio:SN

OM

ED

-NC

Ismalldata

set–errorrate0.1

ToolTim

eU

nsat.Prec.

Rec.

F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec.

Neg.

Prec.A

ML

7620

0.9610.707

0.8150.904

0.7130.797

0.9720.721

0.8282,730

2,7301,331

891181

3270.88

0.732LogM

ap625

160.965

0.640.77

0.9220.663

0.7710.948

0.6390.763

6,7376,737

2,9772,505

490765

0.8590.766

XM

ap943

35,9680.921

0.5910.72

0.9110.564

0.6970.754

0.5410.63

11,91111,682

8,0521,403

2042023

0.9750.41

Table41.L

argeBio:SN

OM

ED

-NC

Ismalldata

set–errorrate0.2

ToolTim

eU

nsat.Prec.

Rec.

F-m.

Prec.non

Rec.

nonF-m

.non

Prec.oracle

Rec.

oracleF-m

.oracle

Tot.R

eqs.D

ist.M

apps.T

PT

NFP

FNPos.Prec.

Neg.

Prec.A

ML

7580

0.9550.697

0.8060.904

0.7130.797

0.9720.719

0.8272,730

2,7301,184

798264

4840.818

0.622LogM

ap635

160.959

0.6350.764

0.9220.663

0.7710.92

0.620.741

7,1597,159

2,6072,563

8541,135

0.7530.693

XM

ap984

36,6190.919

0.5920.72

0.9110.564

0.6970.676

0.5140.584

11,90311,693

7,0901,266

3472,990

0.9530.297

Table42.L

argeBio:SN

OM

ED

-NC

Ismalldata

set–errorrate0.3

Page 39: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Fig. 6. Average time between requests per task in the Conference data set (whiskers: Q1-1,5IQR,Q3+1,5IQR, IQR=Q3-Q1). The labels under the system names show the average number of re-quests and the mean time between the requests (calculated by taking the average of the averagerequest intervals per task) for the ten runs and all tasks.

tial set. LogMap and AML both request feedback on only selected mapping candidates(based on their similarity patterns or their involvement in unsatisfiabilities) and onlypresent one mapping at a time to the user. XMap also presents one mapping at a timeand asks mainly for true negatives. Only Alin employs the new feature in this year’sevaluation: analysing several conflicting mappings simultaneously, whereby a systemcan present up to three mappings together to the oracle, provided that each mappingpresented has a mapped entity, i.e., class or property, in common with at least one othermapping presented.

The performance of the systems improves when interacting with a perfect oraclecompared to no interaction. Although systems’ performance deteriorates when movingtowards larger error rates there are still benefits from the user interaction—some of thesystems’ measures stay above their non-interactive values even for the larger error rates.For the Anatomy track Alin detects only trivial correspondences in the non-interactiveversion while user interactions led to detecting some non-trivial correspondences.

The impact of the oracle’s errors is linear for Alin, AML and XMap and supra-linearfor LogMap for all data sets. The ”Positive Precision” value affects the true positivesand false positives, and the ”Negative Precision” value affects the true negatives and

Page 40: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Fig. 7. Time between requests per task in the HP-MP data set (whiskers: Q1-1,5IQR, Q3+1,5IQR,IQR=Q3-Q1). The labels under the system names show the number of requests and the mean timebetween the requests.

false negatives. The more a system relies on the oracle, the more sensitive it will be toits errors.

In general, XMap performs very few requests to the oracle compared to the othersystems.

Two models for system response times are frequently used in the literature [10]:Shneiderman and Seow take different approaches to categorise the response times.Shneiderman takes a task-centred view and sorts the response times in four categoriesaccording to task complexity: typing, mouse movement (50-150 ms), simple frequenttasks (1 s), common tasks (2-4 s) and complex tasks (8-12 s). He suggests that the useris more tolerable to delays with the growing complexity of the task at hand. Unfortu-nately, no clear definition is given for how to define the task complexity. Seow’s modellooks at the problem from a user-centred perspective by considering the user expec-tations towards the execution of a task: instantaneous (100-200 ms), immediate (0.5-1s), continuous (2-5 s), captive (7-10 s); Ontology alignment is a cognitively demandingtask and can fall into the third or fourth categories in both models. In this regard theresponse times (request intervals as we call them above) observed in all data sets fallinto the tolerable and acceptable response times, and even into the first categories, inboth models. The request intervals for both AML and LogMap stay under 3 ms for all

Page 41: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Fig. 8. Time between requests per task in the DOID-ORDO data set (whiskers: Q1-1,5IQR,Q3+1,5IQR, IQR=Q3-Q1). The labels under the system names show the number of requestsand the mean time between the requests.

data sets. Alin’s request intervals are higher, but still in the tenth of second range. Itcould be the case however that the user could not take advantage of very low responsetimes because the task complexity may result in higher user response time (analogicallyit measures the time the user needs to respond to the system after the system is ready).

Regarding the number of unsatisfiable classes resulting from the alignments weobserve some expected variations as the error increases. We note that, with interaction,the alignments produced by the systems are typically larger than without interaction,which makes the repair process harder. The introduction of oracle errors complicatesthe process further, and may make an alignment irreparable if the system follows theoracle’s feedback blindly.

10 Instance matchingThe instance matching track aims at evaluating the performance of matching toolswhen the goal is to detect the degree of similarity between pairs of items/instancesexpressed in the form of RDF data. The track is organized in three independent taskscalled SABINE, SYNTHETIC and DOREMUS. Each test is based on two data sets calledsource and target and the goal is to discover the matching pairs, i.e., mappings, amongthe instances in the source data set and the instances in the target data set.

Page 42: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Fig. 9. Time between requests per task in the FMA-NCI data set (whiskers: Q1-1,5IQR,Q3+1,5IQR, IQR=Q3-Q1). The labels under the system names show the number of requestsand the mean time between the requests.

For the sake of clarity, we split the presentation of task results in three differentsections as follows.

10.1 Results of the SABINE task

SABINE is a modular benchmark in the domain of European politics for Social Busi-ness Intelligence (SBI) and it includes an ontology with 500 topics, both in English andItalian languages. The task is articulated in two sub-tasks called inter-lingual mappingand data linking.

In inter-lingual mapping, source and target datasets are OWL ontologies containingtopics as instances of the class “Topic”. The source ontology contains topics in theEnglish language; the target ontology contains other topics in the Italian language. Thegoal is to discover mappings between English and Italian topics by also defining thekind of relation which is most suitable for describing the discovered mapping betweentwo matching topics.

In data linking, just the source dataset is defined and it is given to the participantsas an OWL ontology containing topics as instances of the class “Topic”. The goal is todiscover the best corresponding DBpedia entity for each topic in the source ontology.

Page 43: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Fig. 10. Time between requests per task in the SNOMED-NCI data set (whiskers: Q1-1,5IQR,Q3+1,5IQR, IQR=Q3-Q1). The labels under the system names show the number of requests andthe mean time between the requests.

The SABINE sub-tasks are defined as open tests, meaning that the set of expectedmappings, i.e., reference alignment, is given in advance to the participants and it con-stitutes the gold standard for result evaluation. The task size is around 23K ontologyinstances to consider. The gold standard has been defined through crowdsourcing vali-dation though the Argo system14. For creating the gold standard, workers are called torecognize and confirm the mapping between instance topics of source and target on-tologies. In particular, a task is represented as a choice question in which a topic of thesource ontology is specified and a number of instance topics of the target ontology areprovided as possible mappings. A worker receiving a task to execute has to considera source topic and to choose the most appropriate mapping with a target topic amongthose provided as possible options. Multi-worker task assignment and consensus evalu-ation techniques are defined in Argo for quality assessment of the task result. A task isassigned to a group G of 6 different workers. A group member autonomously executesa task and independently produces the answer according to her/his personal feeling andjudgement. Given a task, its result is defined as an answer agreement, i.e., consensus,among the members of the group that executed the task. Two workers agree on a taskresult when they selected the same target topic as mapping with the given source topic.

14 http://island.ricerca.di.unimi.it/projects/argo/ (in Italian).

Page 44: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

The mapping between the source and the target topics is confirmed and inserted in thegold standard when the task answer having the highest degree of consensus within thegroup G is supported by a qualified majority larger than 50%. Conversely, when a qual-ified majority of workers is not found in G, the task is uncommitted and it is scheduledfor re-execution by a different group of workers with higher reliability. Further detailson the Argo techniques for task management are provided in [7]. The gold standard ofthe SABINE task contains 249 crowd-validated mappings for the inter-lingual sub-taskand 338 crowd-validated mappings for the data linking sub-task.

Participants to the SABINE sub-tasks are LogMapIm, AML, LogMapLite, and Ri-MOM. Results are shown in Table 43. For each test, the tool performances are expressedin terms of precision, recall, and F-measure.

Inter-lingual mapping Data linkingingPrecision F-measure Recall Precision F-measure Recall

LogMapIm 0.012 0.014 0.016 NaN NaN 0.0AML 0.919 0.917 0.916 0.926 0.889 0.855LogMapLite 0.358 0.214 0.153 NaN NaN 0.0RiMOM 0.955 0.943 0.932 0.424 0.580 0.917

Table 43. Instance matching results.

We focus our considerations on AML and RiMOM that provided high-value resultsfor precision, recall, and F-measure on both inter-lingual mapping and data linking sub-tasks. In particular, RiMOM outperforms AML on the inter-lingual mapping sub-task, inthat both precision and recall values of RiMOM are higher than the corresponding valuesof AML. However, both tools are over 90% for precision and recall values, meaningthat mapping corresponding instances of different languages is a successfully-addressedtask by RiMOM and AML. In the data linking sub-task, AML outperforms RiMOM onprecision and the difference between the tools on this result value is significant (i.e.,AML >90% and RiMOM <50% on the precision value). On the opposite, for recall, wenote that the RiMOM value is higher than the AML value, and the result of both toolsis very positive (i.e., >85%). We argue that these results on the data linking sub-taskare due to the problem of selecting the most appropriate mapping when a number ofpossible alternatives are available. Both AML and RiMOM are successful in providing aset of candidate DBpedia entities as target mapping with a given OWL instance (i.e.,high recall value). On the opposite, the capability to choose/select the most appropriatemapping among the set of available options is still challenging and only AML succeedsin providing high-quality results on this task (i.e., high precision value).

10.2 Results of the SYNTHETIC task

UOBM and SPIMBENCH tasks are two of the evaluation tasks of instance matchingtools where the goal is to determine when two OWL instances describe the same realworld object. For the first task, the data sets have been produced by altering a set ofsource data and generated by SPIMBENCH [37] with the aim to generate descriptionsof the same entity where value-based, structure-based and semantics-aware transforma-tions are employed in order to create the target data. While, for the latter task the data

Page 45: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

sets have been generated with the University Ontology Benchmark (UOBM) [30] andtransformed with the LANCE benchmark generator [36].

For both tasks, the transformations applied were a combination of value-based,structure-based, and semantics-aware test cases. The value-based transformations con-sider mainly typographical errors and different data formats, the structure-based trans-formations consider transformations applied on the structure of object and datatypeproperties and the semantics-aware transformations are transformations at the instancelevel considering the TBox information. The latter are used to examine if the matchingsystems take into account RDFS and OWL semantics in order to discover correspon-dences between instances that can be found only by considering information found inthe TBox.

We stress that an instance in the source data set can have none or one matchingcounterpart in the target data set. A data set is composed of a TBox and a correspondingABox. Source and target data sets share almost the same TBox (differences in the prop-erties, due to the structure-based transformations). For SPIMBENCH, the sandbox scaleis 10K triples ≈380 instances while the mainbox scale is 50K triples ≈1800 instances.We asked the participants to match the Creative Works instances (NewsItem, BlogPostand Programme) in the source data set against the instances of the corresponding classin the target data set. For UOBM, the sandbox scale is 14K triples ≈2.5K instanceswhile the mainbox scale is 60K triples ≈10K instances. We asked the participants tomatch all the instances that are not common to the two data sets. For both tasks, we ex-pected to receive a set of links denoting the pairs of matching instances that they foundto refer to the same entity.

The participants to these tasks are LogMap, AML and RiMOM. For evaluation, webuilt a ground truth containing the set of expected links where an instance i1 in thesource data set is associated with an instance in the target data set that has been gener-ated as an altered description of i1.

The way that the transformations were done, was to apply value-based, structure-based and semantics-aware transformations, on different triples pertaining to one classinstance.

The systems were judged on the basis of precision, recall and F-measure results thatare shown in Tables 44 and 45.

Sandbox task Mainbox taskPrecision F-measure Recall Precision F-measure Recall

LogMap 0.958 0.851 0.766 0.981 0.814 0.695AML 0.907 0.82 0.749 0.9 0.816 0.747RiMOM 0.984 0.992 1 0.991 0.995 1

Table 44. Results of the SPIMBENCH task.

LogMap responds well regarding the SPIMBENCH task, while the performancedrops when matching the data sets of the UOBM task. LogMap is automatic and doesnot require the definition of a configuration file in contrast to AML and RiMOM.

Page 46: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Sandbox task Mainbox taskPrecision F-measure Recall Precision F-measure Recall

LogMap 0.701 0.32 0.207 0.625 0.044 0.023AML 0.785 0.665 0.577 0.509 0.512 0.515RiMOM 0.771 0.821 0.877 0.443 0.477 0.516

Table 45. Results of the UOBM task.

AML responds well regarding the SPIMBENCH task, while the performance dropswhen matching the data sets of the UOBM task. AML had to turn off the reasoner inorder to handle missing information about the domain and range of TBox properties.

LogMap and AML produce links that are quite often correct (resulting in a goodprecision) but fail in capturing a large number of the expected links (resulting in alower recall).

RiMOM performs better than any other system for most of the tasks; it performsexcellent in the case of SPIMBENCH but, although it exhibits the best results for theSandbox track of UOBM, its performance drops for the Mainbox track. For RiMOM,the probability of capturing a correct link is high, but the probability of a retrieved linkto be correct is lower, resulting in a high recall but not a high precision.

The main comments for the SPIMBENCH and UOBM tasks are:

– LogMap and AML have consistent behaviour regarding Sandbox and Mainbox.– RiMOM has a consistent behaviour for the SPIMBENCH task and an inconsistent

behaviour for the UOBM task.– All systems performed well for the SPIMBENCH task.– The UOBM data sets seem to be more “difficult” for both IM systems, and this dif-

ficulty stems from the data set itself, rather than from the transformations imposedby LANCE.

– The UOBM data sets seem to be more difficult for both IM systems, and this dif-ficulty stems from the data set itself, rather than from the transformations imposedby LANCE. In particular, an important source of difficulty for the systems is thatthe URIs of the instances in the data set look very similar to each other, so even thechange of a URI can lead to false positives or false negatives.

10.3 Results of the DOREMUS task

The DOREMUS task, having its premier at OAEI, contains real world data sets comingfrom two major French cultural institutions—The BnF (French National Library) andthe PP (Philharmonie de Paris). The data are about classical music works and follow theDOREMUS model (one single vocabulary for both data sets) issued from the DORE-MUS project15. Each data entry, or instance, is a bibliographical record about a musicalpiece, containing properties such as the composer, the title(s) of the work, the year ofcreation, the key, the genre, the instruments, to name a few. These data have been con-verted to RDF from their original UNI- and INTER-MARC format and anchored to theDOREMUS ontology and a set of domain controlled vocabularies by the help of themarc2rdf converter16, developed for this purpose within the DOREMUS Project (for15 http://www.doremus.org16 https://github.com/DOREMUS-ANR/marc2rdf

Page 47: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

more details on the conversion method and on the ontology we refer to [1] and [29]).Note that these data are highly heterogeneous. We have selected works described bothat the BnF and at the PP with different degrees of heterogeneity in their descriptions.The data sets have been selected in three sub-tasks.

Nine heterogeneities. This task consists in aligning two small data sets, BnF-1 and PP-1, containing about 40 instances each, by discovering 1:1 equivalence relations betweentheir instances. There are 9 types of heterogeneities that these data manifest, that havebeen identified by the music library experts, such as multilingualism, differences in cat-alogues, differences in spelling, different degrees of description (number of properties).

Four heterogeneities. This task consists in aligning two larger data sets, BnF-2 andPP-2, containing about 200 instances each, by discovering 1:1 equivalence relationsbetween the instances that they contain. There are 4 types of heterogeneities that thesedata manifest, that we have selected from the nine in Task 1 and that appear to bethe most problematic: 1) Orthographical differences, 2) Multilingual titles, 3) Missingproperties, 4) Missing titles.

The False Positives Trap. This task consists in correctly disambiguating the instancescontained in two data sets, BnF-3 and PP-3, by discovering 1:1 equivalence relationsbetween the instances that they contain. We have selected several groups of pairs ofworks with highly similar descriptions where there exists only one correct match ineach group. The goal is to challenge the linking tools capacity to avoid the generationof false positives and match correctly instances in the presence of highly similar butstill distinct candidates.

9 heterogeneities 4 heterogeneities False positive trapPrec. F-m. Rec. Prec. F-m. Rec. Prec. F-m. Rec.

AML (th=0.2) 0.966 0.918 0.875 0.934 0.848 0.776 0.921 0.886 0.854AML (th=0.6) 0.962 0.862 0.781 0.943 0.83 0.741 0.853 0.773 0.707RiMOM 0.813 0.813 0.813 0.746 0.746 0.746 0.707 0.707 0.707

Table 46. Results of the DOREMUS task

Results Only two systems returned results on the track: AML and RiMOM. Note thatAML has been configured with two different thresholds. The results of their perfor-mances, evaluated by using precision, recall and F-measure, on each of the three taskscan be seen in Table 46. The best performance in terms of F-measure is provided by theAML tool with a threshold of 0.2 on all tasks.

11 Process Model MatchingIn 2013 and in 2015 the community interested in business process modelling conductedan evaluation campaign similar to OAEI [3]. Instead of matching ontologies, the taskwas to match process models described in different formalisms like BPMN and PetriNets. Within this track we offer a subset of the tasks from the Process Model MatchingContest as OAEI track by converting the process models to an ontological represen-tation. By offering this track, we hope to gain insights in how far ontology matching

Page 48: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

systems are capable of solving the more specific problem of matching process mod-els. This track is also motivated by the discussions at the end of the 2015 OntologyMatching workshop, where many participants showed their interest in such a track.

11.1 Experimental Settings

We were using the first data set from the 2015 Process Matching Contest. This data setdeals with processing applications to a university. It consists of nine different processmodels where each describes the concrete process of a specific German university. Themodels are encoded as BPMN process models. We converted the BPMN representa-tion of the process models to a set of assertions (ABox) using the vocabulary definedin the BPMN 2.0 ontology (TBox). For that reason the resulting matching task is aninstance matching task where each ABox is described by the same TBox. For eachpair of processes manually generated reference alignments are available. Typical activ-ities within that domain are Sending acceptance, Invite student for interview, or Waitfor response. These examples illustrate one of the main differences from the ontologymatching task. The labels are usually verb-object phrases that are sometimes extendedwith more words. Another important difference is related to the existence of an execu-tion order, i.e., the model is a complex sequence of activities, which can be understoodas the counterpart to a type hierarchy.

Only few systems have been marked as capable of generating alignments for theProcess Model Matching track. We have tried to execute all these systems, however,some of them generated only trivial TBox mappings instead of mappings between ac-tivities. After contacting the developer of the systems, we received the feedback thatthe systems have been marked mistakenly and are designed for terminological match-ing only. We have excluded them from the evaluation. Moreover, we tried to run allsystems that were marked as instance matching tools, which have been submitted asexecutable SEALS bundles. One of these tools (LogMap), generated meaningful resultsand was also added to the set of systems that we evaluated. Finally we evaluated threesystems (AML, LogMap, and DKP), one of these systems was configured in two differentsettings related to the treatment of events-to-activity mappings. This was the tool DKP.Thus we distinguish between DKP and DKP*.

In our evaluation, we computed standard precision and recall, as well as the har-monic mean known as F-measure. The data set we used consists of several test cases.We aggregated the results and present the micro average results. The gold standard weused for our first set of evaluation experiments is based on the gold standard that hasalso been used at the Process Model Matching Contest in 2015 [3]. We modified onlysome minor mistakes (resulting in changes less than 0.5 percentage points). In order tocompare the results to the results obtained by the process model matching community,we present also the recomputed values of the submissions to the 2015 contest.

Moreover, we extended our evaluation (“Standard” in Table 47) by a new evalua-tion measure that makes use of a probabilistic reference alignment (“Probabilistic” inTable 47). This probabilistic measure is based on a gold standard which is manuallyand independently generated by several domain experts. The number of votes of theseannotators are applied as support values in the probabilistic evaluation. For a detaileddiscussion, please refer to [28].

Page 49: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

11.2 Results

Table 47 summarises the results of our evaluation. “P” abbreviates precision, “R” isrecall, “FM” stands for F-measure and “Rk” means rank. The prefix “Pro” indicates theprobabilistic versions of the precision, recall, F-measure and the associated rank. Thesemetrics are explained below. Participants of the Process Model Matching Contest in2015 (PMMC 2015) are depicted in grey font, while OAEI 2016 participants are shownin black font. The OAEI participants are ranked on position 1, 8, 9 and 11 with an overallnumber of 16 systems listed in the table (when using the standard metrics). Note thatAML-PM at the PMMC 2015 was a matching system that was based on a predecessorof AML participating at OAEI 2016. The good results of AML are surprising, since weexpected that matching systems specifically developed for the purpose of process modelmatching would outperform ontology matching systems applied to the special case ofprocess model matching. While AML contains also components that are specificallydesigned for the process matching task (a flooding-like structural matching algorithm),its relevant main components are components developed for ontology matching and thesub-problem of instance matching.

Participants Standard ProbabilisticMatcher Contest Size P R FM Rk ProP ProR ProFM RkAML OAEI-16 221 0,719 0,685 0,702 1 0,742 0,283 0,410 2AML-PM PMMC-15 579 0,269 0,672 0,385 14 0,377 0,398 0,387 4BPLangMatch PMMC-15 277 0,368 0,440 0,401 12 0,532 0,272 0,360 8DKP OAEI-16 177 0,621 0,474 0,538 8 0,686 0,219 0,333 9DKP* OAEI-16 150 0,680 0,440 0,534 9 0,772 0,211 0,331 10KnoMa-Proc PMMC-15 326 0,337 0,474 0,394 13 0,506 0,302 0,378 5KMatch-SSS PMMC-15 261 0,513 0,578 0,544 6 0,563 0,274 0,368 7LogMap OAEI-16 267 0,449 0,517 0,481 11 0,594 0,291 0,390 3Match-SSS PMMC-15 140 0,807 0,487 0,608 4 0,761 0,192 0,307 12OPBOT PMMC-15 234 0,603 0,608 0,605 5 0,648 0,258 0,369 6pPalm-DS PMMC-15 828 0,162 0,578 0,253 16 0,210 0,335 0,258 16RMM-NHCM PMMC-15 220 0,691 0,655 0,673 2 0,783 0,297 0,431 1RMM-NLM PMMC-15 164 0,768 0,543 0,636 3 0,681 0,197 0,306 13RMM-SMSL PMMC-15 262 0,511 0,578 0,543 7 0,516 0,242 0,329 11RMM-VM2 PMMC-15 505 0,216 0,470 0,296 15 0,309 0,294 0,301 14TripleS PMMC-15 230 0,487 0,483 0,485 10 0,486 0,210 0,293 15

Table 47. Results of the process model matching track

In the probabilistic evaluation, however, the OAEI participants gain position 2, 3, 9and 10, respectively. LogMap rises from position 11 to 3. The (probabilistic) precisionimproves over-proportionally for this matcher, because LogMap generates many corre-spondences which are not included in the binary gold standard but are included in theprobabilistic one. The ranking of LogMap demonstrates that a strength of the probabilis-tic metric lies in the broadened definition of the gold standard where weak mappingsare included but softened (via the support values).

Page 50: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Figures 11(a)-(b) show the probabilistic precision (ProP) and the probabilistic re-call (ProR) with rising threshold τ on the reference alignment (0,000; 0,375; 0,500;0,750). The matcher LogMap mainly identifies correspondences with high support (ofwhich many are not included in the binary gold standard). This can be observed by theminor change in the ProP and the significant increase in the ProR with higher τ . Forthe matcher AML, the opposite effect can be observed. The ProP decreases dramaticallywith rising τ (accompanied by a weak increase of the ProR). This indicates that thematcher computes a high fraction of correspondences with low support value (whichare partly included in the binary gold standard). For the matchers DKP and DKP*, withincreasing τ , a minor decrease in ProP and increase in ProR can be observed. TheProP decreases, since the number of correspondences in the non-binary gold standarddecreases (with rising τ ). At the same time, the ProR increases with a lower numberof correspondences (with rising τ ). Figure 11(c) displays the probabilistic F-measure(ProFM) with rising threshold τ on the reference alignment. AML achieves best resultswith τ = 0,375 since this matcher identifies a high fraction of correspondences withlow support value (which can also be trivial correspondences). For details about theprobabilistic metric, please refer to [28].

The results depicted in Table 47 and Figure 11 indicate that the progress made inontology matching has also a positive impact on other related matching problems, likeit is the case for process model matching. While it might require to reconfigure, adapt,and extend some parts of the ontology matching systems, such a system seems to offera good starting point which can be turned with a reasonable amount of work into a goodprocess matching tool. We have to emphasise that our observations are so far based ononly one data set. Moreover, only three participants decided to apply their systems tothe new track of process model matching. Thus, we have to be cautious to generalisethe results we observed so far. In the future we might be able to attract more participantsintegrating more data sets in the evaluation.

12 Lesson learned and suggestionsThe lessons learned from running OAEI 2016 were the following:

A) This year, as suggested in previous campaigns, we requested tool registration inJune and preliminary submission of wrapped systems by the end of July. This mea-sure was successful in reducing the number of systems with errors and incompati-bilities with the SEALS client during the evaluation phase as had happened in thepast. However, not all systems complied with the deadlines, and some did haveproblems, which still delayed the evaluation. In future editions, we must be morestrict in enforcing the participation protocol.

B) Thanks in part to the new submission schedule, this marked the first OAEI editionwhere all participants and all tracks were evaluated using the SEALS client. Nev-ertheless, some system developers still struggled to get their systems working withthe client, mostly due to incompatible versions of libraries. This recurring problem,plus the effort required to update the SEALS client’s libraries, lead to the consid-eration of whether it would not be better to develop a simpler, more streamlinedevaluation solution.

Page 51: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

(a) Probabilistic precision

(b) Probabilistic recall

(c) Probabilistic F-measure

Fig. 11. Change in metric values with rising threshold τ .

Page 52: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

C) The continued absence of the SEALS web portal did not seem to affect participa-tion, as the Google drive solution for submission was well received by the partici-pants. OAEI may move towards a cloud-based solution.

D) While the number of participants this year was similar to that of recent years, theirdistribution through the tracks was uneven. Long-standing tracks had no shortageof participants, but alas the same was not true for the Interactive, Process Model(new) or Instance (new data sets) tracks. One reason for this is that the OAEI datasets have been released too close to the submission deadline to allow system devel-opers to develop their systems to tackle them all—the timing is barely sufficient toallow serious development focusing on one new data set. Thus, with prize moneyon offer on one of the new tracks, it is no surprise that system developers werepolarised towards that track and eschewed the other new ones. We should consideranticipating the deadline for initial release of OAEI data sets, particular for thosethat are new, in order to give system developers more time to tackle them, therebyincreasing participation.

E) The increasing variety of OAEI tracks also poses difficulties to system developersin configuring their systems to handle different types of tasks. It is noteworthy thatonly two systems, both of which are long-term OAEI participants, have tackled alltracks—and one of them did so using external configuration files specifying thetype of task. One solution to facilitate participation in multiple tracks would beto have the evaluation client transmit to the system the specifications of the task,e.g., whether classes, properties, and/or individuals are to be matched, and whetheronly a specific subset of them are to be matched. This would also make the tasksmore realistic, in the sense that in normal use, a user would provide to the ontologymatching system this type of information.

F) With regard to the low participation in the Process Model and Instance tracks, itmerits considering whether enforcing adherence to the SEALS client and ontology-based data sets were not deterrent factors. It should be noted that the Process ModelMatching Contest (PMMC) received a much larger number of participants in 2015than did the Process Model track, and that there is a considerable number of pub-lications on data interlinking systems, but only one of these participated in theInstance track.

G) In previous years we identified the need for considering non-binary forms of eval-uation, namely in cases where there is uncertainty about some of the referencemappings. A first non-binary evaluation type was implemented in last year’s Con-ference track, but this year two new tracks followed suit: Disease and Phenotypewhere the evaluation was semantic, and Process Model, where it was probabilistic.These new strategies should provide a fairer evaluation of the systems in complextest cases.

The lessons learned in the various OAEI 2016 track were the following:

largebio: While the current reference alignments, with incoherence-causing mappingsflagged as uncertain, make the evaluation fair to all systems, they are only a com-promise solution, not an ideal one. Thus, we should aim for manually repairing andvalidating the reference alignments for future editions.

Page 53: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

phenotype: The prize offered in this track, thanks to the kind sponsorship of the PistoiaAlliance Ontologies Mapping project, was positively accepted by the communityand helped attract new participants. However, it also had a polarising effect, withsome systems focusing exclusively in this track. In future editions, we will considerincluding a prize across OAEI tracks in order to motivate developers to successfullyparticipate in more than one track.

interactive: The new functionality of the Oracle allowing systems to submit a set of upto three conflicting mappings, rather than a mapping at a time, was successfully ex-ploited by one new participating system. Nevertheless, this track’s participation hasremained low, as most systems participating in OAEI focussed exclusively on fullyautomatic matching. We hope to draw more participants to this track in the futureand will continue to expand it so as to better approximate real user interactions.

process model: The results of the new Process Model track have shown that the partic-ipating ontology matching systems are capable of generating very good results forthe specific problem of process model matching. This shows that the basic com-ponents of an ontology matching system can also be successfully applied to otherkind of matching problems.

instance: In order to attract more instance matching systems to participate in valuesemantics (val-sem), value structure (val-struct), and value structure semantics (val-struct-sem) tasks, we need to produce benchmarks that have fewer instances (in theorder of 10000), of the same type (in our benchmark we asked systems to compareinstances of different types). To balance those aspects, we must then produce datasets with more complex transformations.

13 ConclusionsOAEI 2016 saw the same number (21) of participants as in recent years, with a healthymix of new and returning systems. While some new participants were mainly drawnby the allure of prize money in the new Disease and Phenotype track, the very factthat there was prize money on offer shows that interest in ontology matching is notwaning, which bodes well for the future of OAEI. All the test cases were performedon the SEALS client, including those in the instance matching track, which is goodnews regarding the interoperability of matching systems. Furthermore, the fact that theSEALS client can be used for such a variety of tasks is a good sign of its relevance.

Unlike previous years, this year there was no noticeable improvement with regardto system run times—for instance, the distribution of run times in Anatomy and LargeBiomedical Ontologies was approximately the same as last year. There was also noprogress with regard to the ability to handle large ontologies and data sets, as the numberof systems able to cope with the Large Biomedical Ontologies data set in full was thesame as last year, and all systems able to cope with the Instance Synthetic data set wereestablished systems already known for their ability to handle large data sets. Finally,there was no progress with regard to alignment repair systems, with only a few returningsystems employing them. As a consequence, incoherent alignments are common.

With regard to F-measure, some returning systems showed substantial improve-ments, but overall, the improvements in F-measure were subtle in Anatomy and LargeBiomedical Ontologies, and non-existent in Conference. As has been the trend, mostsystems favour precision over recall.

Page 54: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

Most of the participants have provided a description of their systems and their ex-perience in the evaluation. These OAEI papers, like the present one, have not been peerreviewed. However, they are full contributions to this evaluation exercise and reflectthe hard work and clever insight people put into the development of participating sys-tems. Reading the papers of the participants should help people involved in ontologymatching find out what makes these algorithms work and what could be improved.

The Ontology Alignment Evaluation Initiative will strive to continue to be a ref-erence to the ontology matching community by improving both the test cases and thetesting methodology to better reflect the actual needs of the community. Evaluating on-tology matching systems remains a challenging but critical topic, which is essential toenable the progress of this field [38]. More information can be found at:

http://oaei.ontologymatching.org.

AcknowledgementsWe warmly thank the participants of this campaign. We know that they have worked hard to havetheir matching tools executable in time and they provided useful reports on their experience. Thebest way to learn about the results remains to read the papers that follow.

We would also like to thank the Pistoia Alliance9 which sponsored the Disease and Phenotypetrack and funded the prize for the winners.

We are very grateful to the Universidad Politecnica de Madrid (UPM), especially to Nan-dana Mihindukulasooriya and Asuncion Gomez Perez, for moving, setting up and providing thenecessary infrastructure to run the SEALS repositories.

We are also grateful to Martin Ringwald and Terry Hayamizu for providing the referencealignment for the anatomy ontologies and thank Elena Beisswanger for her thorough support onimproving the quality of the data set.

We thank Khiat Abderrahmane for his support in the Arabic data set and Catherine Comparotfor her feedback and support in the MultiFarm test case.

We also thank for their support the other members of the Ontology Alignment Evaluation Ini-tiative steering committee: Yannis Kalfoglou (Ricoh laboratories, UK), Miklos Nagy (The OpenUniversity (UK), Natasha Noy (Stanford University, USA), Yuzhong Qu (Southeast University,CN), York Sure (Leibniz Gemeinschaft, DE), Jie Tang (Tsinghua University, CN), George Vouros(University of the Aegean, GR).

Michelle Cheatham has been supported by the National Science Foundation award ICER-1440202 “EarthCube Building Blocks: Collaborative Proposal: GeoLink”.

Jerome Euzenat, Ernesto Jimenez-Ruiz, Christian Meilicke, Heiner Stuckenschmidt andCassia Trojahn dos Santos have been partially supported by the SEALS (IST-2009-238975) Eu-ropean project in previous years.

Daniel Faria was supported by the ELIXIR-EXCELERATE project (INFRADEV-3-2015).Ernesto Jimenez-Ruiz has also been partially supported by the Seventh Framework Program

(FP7) of the European Commission under Grant Agreement 318338, “Optique”, the EPSRCprojects DBOnto and ED3, the Research Council of Norway project BigMed, and the Centrefor Scalable Data Access (SIRIUS).

Catia Pesquita was supported by the FCT through the LASIGE Strategic Project(UID/CEC/00408/2013) and the research grant PTDC/EEI-ESS/4633/2014.

Ondrej Zamazal has been supported by the CSF grant no. 14-14076P.

Page 55: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

References1. Manel Achichi, Rodolphe Bailly, Cecile Cecconi, Marie Destandau, Konstantin Todorov,

and Raphael Troncy. Doremus: Doing reusable musical data. In ISWC PD: InternationalSemantic Web Conference Posters and Demos, 2015.

2. Jose Luis Aguirre, Bernardo Cuenca Grau, Kai Eckert, Jerome Euzenat, Alfio Ferrara,Robert Willem van Hague, Laura Hollink, Ernesto Jimenez-Ruiz, Christian Meilicke, An-driy Nikolov, Dominique Ritze, Francois Scharffe, Pavel Shvaiko, Ondrej Svab-Zamazal,Cassia Trojahn, and Benjamin Zapilko. Results of the ontology alignment evaluation initia-tive 2012. In Proc. 7th ISWC ontology matching workshop (OM), Boston (MA US), pages73–115, 2012.

3. Goncalo Antunes, Marzieh Bakhshandeh, Jose Borbinha, Joao Cardoso, Sharam Dadash-nia, Chiara Di Francescomarino, Mauro Dragoni, Peter Fettke, Avigdor Gal, Chiara Ghi-dini, Philip Hake, Abderrahmane Khiat, Christopher Klinkmuller, Elena Kuss, HenrikLeopold, Peter Loos, Christian Meilicke, Tim Niesen, Catia Pesquita, Timo Peus, AndreasSchoknecht, Eitam Sheetrit, Andreas Sonntag, Heiner Stuckenschmidt, Tom Thaler, IngoWeber, and Matthias Weidlich. The process model matching contest 2015. In 6th Interna-tional Workshop on Enterprise Modelling and Information Systems Architectures, September3-4, 2015 Innsbruck, Austria, pages 127–155, 2015.

4. Benjamin Ashpole, Marc Ehrig, Jerome Euzenat, and Heiner Stuckenschmidt, editors. Proc.K-Cap Workshop on Integrating Ontologies, Banff (Canada), 2005.

5. Olivier Bodenreider. The unified medical language system (UMLS): integrating biomedicalterminology. Nucleic Acids Research, 32:267–270, 2004.

6. Caterina Caracciolo, Jerome Euzenat, Laura Hollink, Ryutaro Ichise, Antoine Isaac,Veronique Malaise, Christian Meilicke, Juan Pane, Pavel Shvaiko, Heiner Stuckenschmidt,Ondrej Svab-Zamazal, and Vojtech Svatek. Results of the ontology alignment evaluation ini-tiative 2008. In Proc. 3rd ISWC ontology matching workshop (OM), Karlsruhe (DE), pages73–120, 2008.

7. Silvana Castano, Alfio Ferrara, Lorenzo Genta, and Stefano Montanelli. Combining CrowdConsensus and User Trustworthiness for Managing Collective Tasks. Future GenerationComputer Systems, 54, 2016.

8. Michelle Cheatham, Zlatan Dragisic, Jerome Euzenat, Daniel Faria, Alfio Ferrara, GiorgosFlouris, Irini Fundulaki, Roger Granada, Valentina Ivanova, Ernesto Jimenez-Ruiz, et al.Results of the ontology alignment evaluation initiative 2015. In 10th ISWC workshop onontology matching (OM), pages 60–115, 2015.

9. Bernardo Cuenca Grau, Zlatan Dragisic, Kai Eckert, Jerome Euzenat, Alfio Ferrara, RogerGranada, Valentina Ivanova, Ernesto Jimenez-Ruiz, Andreas Oskar Kempf, Patrick Lam-brix, Andriy Nikolov, Heiko Paulheim, Dominique Ritze, Francois Scharffe, Pavel Shvaiko,Cassia Trojahn dos Santos, and Ondrej Zamazal. Results of the ontology alignment evalu-ation initiative 2013. In Pavel Shvaiko, Jerome Euzenat, Kavitha Srinivas, Ming Mao, andErnesto Jimenez-Ruiz, editors, Proc. 8th ISWC workshop on ontology matching (OM), Syd-ney (NSW AU), pages 61–100, 2013.

10. Jim Dabrowski and Ethan V. Munson. 40 years of searching for the best computer systemresponse time. Interacting with Computers, 23(5):555–564, 2011.

11. Jerome David, Jerome Euzenat, Francois Scharffe, and Cassia Trojahn dos Santos. Thealignment API 4.0. Semantic web journal, 2(1):3–10, 2011.

12. Zlatan Dragisic, Kai Eckert, Jerome Euzenat, Daniel Faria, Alfio Ferrara, Roger Granada,Valentina Ivanova, Ernesto Jimenez-Ruiz, Andreas Oskar Kempf, Patrick Lambrix, Ste-fano Montanelli, Heiko Paulheim, Dominique Ritze, Pavel Shvaiko, Alessandro Solimando,Cassia Trojahn dos Santos, Ondrej Zamazal, and Bernardo Cuenca Grau. Results of the

Page 56: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

ontology alignment evaluation initiative 2014. In Proceedings of the 9th International Work-shop on Ontology Matching collocated with the 13th International Semantic Web Conference(ISWC), Riva del Garda, Trentino, Italy, pages 61–104, 2014.

13. Zlatan Dragisic, Valentina Ivanova, Patrick Lambrix, Daniel Faria, Ernesto Jimenez-Ruiz,and Catia Pesquita. User validation in ontology alignment. In The Semantic Web - ISWC2016 - 15th International Semantic Web Conference, Kobe, Japan, October 17-21, 2016,Proceedings, Part I, pages 200–217, 2016.

14. Jerome Euzenat, Alfio Ferrara, Laura Hollink, Antoine Isaac, Cliff Joslyn, VeroniqueMalaise, Christian Meilicke, Andriy Nikolov, Juan Pane, Marta Sabou, Francois Scharffe,Pavel Shvaiko, Vassilis Spiliopoulos, Heiner Stuckenschmidt, Ondrej Svab-Zamazal, Vo-jtech Svatek, Cassia Trojahn dos Santos, George Vouros, and Shenghui Wang. Results ofthe ontology alignment evaluation initiative 2009. In Proc. 4th ISWC ontology matchingworkshop (OM), Chantilly (VA US), pages 73–126, 2009.

15. Jerome Euzenat, Alfio Ferrara, Christian Meilicke, Andriy Nikolov, Juan Pane, FrancoisScharffe, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej Svab-Zamazal, Vojtech Svatek, andCassia Trojahn dos Santos. Results of the ontology alignment evaluation initiative 2010. InProc. 5th ISWC ontology matching workshop (OM), Shanghai (CN), pages 85–117, 2010.

16. Jerome Euzenat, Alfio Ferrara, Robert Willem van Hague, Laura Hollink, Christian Meil-icke, Andriy Nikolov, Francois Scharffe, Pavel Shvaiko, Heiner Stuckenschmidt, OndrejSvab-Zamazal, and Cassia Trojahn dos Santos. Results of the ontology alignment evalu-ation initiative 2011. In Proc. 6th ISWC ontology matching workshop (OM), Bonn (DE),pages 85–110, 2011.

17. Jerome Euzenat, Antoine Isaac, Christian Meilicke, Pavel Shvaiko, Heiner Stuckenschmidt,Ondrej Svab, Vojtech Svatek, Willem Robert van Hage, and Mikalai Yatskevich. Results ofthe ontology alignment evaluation initiative 2007. In Proc. 2nd ISWC ontology matchingworkshop (OM), Busan (KR), pages 96–132, 2007.

18. Jerome Euzenat, Christian Meilicke, Pavel Shvaiko, Heiner Stuckenschmidt, and Cassia Tro-jahn dos Santos. Ontology alignment evaluation initiative: six years of experience. Journalon Data Semantics, XV:158–192, 2011.

19. Jerome Euzenat, Malgorzata Mochol, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej Svab,Vojtech Svatek, Willem Robert van Hage, and Mikalai Yatskevich. Results of the ontologyalignment evaluation initiative 2006. In Proc. 1st ISWC ontology matching workshop (OM),Athens (GA US), pages 73–95, 2006.

20. Jerome Euzenat, Maria Rosoiu, and Cassia Trojahn dos Santos. Ontology matching bench-marks: generation, stability, and discriminability. Journal of web semantics, 21:30–48, 2013.

21. Jerome Euzenat and Pavel Shvaiko. Ontology matching. Springer-Verlag, Heidelberg (DE),2nd edition, 2013.

22. Daniel Faria, Ernesto Jimenez-Ruiz, Catia Pesquita, Emanuel Santos, and Francisco M.Couto. Towards Annotating Potential Incoherences in BioPortal Mappings. In 13th In-ternational Semantic Web Conference, volume 8797 of Lecture Notes in Computer Science,pages 17–32. Springer, 2014.

23. Valentina Ivanova, Patrick Lambrix, and Johan Aberg. Requirements for and evaluation ofuser support for large-scale ontology alignment. In The Semantic Web. Latest Advances andNew Domains 12th European Semantic Web Conference, ESWC 2015, Portoroz, Slovenia,May 31 – June 4, 2015. Proceedings, pages 3–20, 2015.

24. Ernesto Jimenez-Ruiz and Bernardo Cuenca Grau. LogMap: Logic-based and scalable on-tology matching. In Proc. 10th International Semantic Web Conference (ISWC), Bonn (DE),pages 273–288, 2011.

25. Ernesto Jimenez-Ruiz, Bernardo Cuenca Grau, Ian Horrocks, and Rafael Berlanga. Logic-based assessment of the compatibility of UMLS ontology sources. J. Biomed. Sem., 2, 2011.

Page 57: Results of the Ontology Alignment Evaluation Initiative 2016disi.unitn.it › ~pavel › om2016 › papers › oaei16_paper0.pdf · 15 Novartis Institutes for Biomedical Research,

26. Ernesto Jimenez-Ruiz, Christian Meilicke, Bernardo Cuenca Grau, and Ian Horrocks. Eval-uating mapping repair systems with large biomedical ontologies. In Proc. 26th DescriptionLogics Workshop, 2013.

27. Yevgeny Kazakov, Markus Krotzsch, and Frantisek Simancik. Concurrent classification ofEL ontologies. In Proc. 10th International Semantic Web Conference (ISWC), Bonn (DE),pages 305–320, 2011.

28. Elena Kuss, Henrik Leopold, Han Van der Aa, Heiner Stuckenschmidt, and Hajo A. Reijers.Probabilistic evaluation of process model matching techniques. In Lecture notes in com-puter science. Conceptual modeling: 35th international conference, ER 2016, Gifu, Japan,November 14-17, 2016, pages 279–292, 2016.

29. Pasquale Lisena, Manel Achichi, Eva Fernandez, Konstantin Todorov, and Raphael Troncy.Exploring linked classical music catalogs with overture. In ISWC PD: International Seman-tic Web Conference Posters and Demos, 2016.

30. L. Ma, Y. Yang, Z. Qiu, G, Xie, Y. Pan, and S. Liu. Towards a Complete OWL OntologyBenchmark. In ESWC, 2006.

31. Christian Meilicke. Alignment Incoherence in Ontology Matching. PhD thesis, UniversityMannheim, 2011.

32. Christian Meilicke, Raul Garcıa Castro, Frederico Freitas, Willem Robert van Hage, ElenaMontiel-Ponsoda, Ryan Ribeiro de Azevedo, Heiner Stuckenschmidt, Ondrej Svab-Zamazal,Vojtech Svatek, Andrei Tamilin, Cassia Trojahn, and Shenghui Wang. MultiFarm: A bench-mark for multilingual ontology matching. Journal of web semantics, 15(3):62–68, 2012.

33. Boris Motik, Rob Shearer, and Ian Horrocks. Hypertableau reasoning for description logics.Journal of Artificial Intelligence Research, 36:165–228, 2009.

34. Heiko Paulheim, Sven Hertling, and Dominique Ritze. Towards evaluating interactive ontol-ogy matching tools. In Proc. 10th Extended Semantic Web Conference (ESWC), Montpellier(FR), pages 31–45, 2013.

35. Catia Pesquita, Daniel Faria, Emanuel Santos, and Francisco Couto. To repair or not torepair: reconciling correctness and coherence in ontology reference alignments. In Proc. 8thISWC ontology matching workshop (OM), Sydney (AU), pages 13–24, 2013.

36. Tzanina Saveta, Evangelia Daskalaki, Giorgos Flouris, Irini Fundulaki, Melanie Herschel,and Axel-Cyrille Ngonga Ngomo. Lance: Piercing to the heart of instance matching tools.In International Semantic Web Conference, pages 375–391. Springer, 2015.

37. Tzanina Saveta, Evangelia Daskalaki, Giorgos Flouris, Irini Fundulaki, Melanie Herschel,and Axel-Cyrille Ngonga Ngomo. Pushing the limits of instance matching systems: Asemantics-aware benchmark for linked data. In WWW, Companion Volume, 2015.

38. Pavel Shvaiko and Jerome Euzenat. Ontology matching: state of the art and future challenges.IEEE Transactions on Knowledge and Data Engineering, 25(1):158–176, 2013.

39. Alessandro Solimando, Ernesto Jimenez-Ruiz, and Giovanna Guerrini. Detecting and cor-recting conservativity principle violations in ontology-to-ontology mappings. In The Seman-tic Web–ISWC 2014, pages 1–16. Springer, 2014.

40. Alessandro Solimando, Ernesto Jimenez-Ruiz, and Giovanna Guerrini. Minimizing con-servativity violations in ontology alignments: Algorithms and evaluation. Knowledge andInformation Systems, 2016.

41. York Sure, Oscar Corcho, Jerome Euzenat, and Todd Hughes, editors. Proc. ISWC Workshopon Evaluation of Ontology-based Tools (EON), Hiroshima (JP), 2004.

Montpellier, Dayton, Linkoping, Grenoble, Lisboa, Milano, Heraklion,Kent, Oslo, Oxford, Mannheim, Amsterdam, Trento, Basel, Toulouse, Prague

December 2016


Recommended