+ All Categories
Home > Documents > NeCO: Ontology Alignment using Near-miss Clone Detection

NeCO: Ontology Alignment using Near-miss Clone Detection

Date post: 22-Feb-2022
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
97
NeCO: Ontology Alignment using Near-miss Clone Detection by Paul Louis Geesaman A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen’s University Kingston, Ontario, Canada January 2014 Copyright c Paul Louis Geesaman, 2014
Transcript
Page 1: NeCO: Ontology Alignment using Near-miss Clone Detection

NeCO: Ontology Alignment using Near-miss

Clone Detection

by

Paul Louis Geesaman

A thesis submitted to the

School of Computing

in conformity with the requirements for

the degree of Master of Science

Queen’s University

Kingston, Ontario, Canada

January 2014

Copyright c© Paul Louis Geesaman, 2014

Page 2: NeCO: Ontology Alignment using Near-miss Clone Detection

Abstract

The Semantic Web is an endeavour to enhance the web with the ability to repre-

sent knowledge. The knowledge is expressed through what are called ontologies. In

order to make ontologies useful, it is important to be able to match the knowledge

represented in different ontologies. This task is commonly known as ontology align-

ment. Ontology alignment has been studied, but it remains an open problem with

an annual competition dedicated to measure alignment tools’ performance. Many

alignment tools are computationally heavy, require training, or are useful in a specific

field of study. We propose an ontology alignment method, NeCO, that builds on clone

detection techniques to align ontologies. NeCO inherits the clone detection features,

and it is light-weight, does not require training, and is useful for any ontology.

i

Page 3: NeCO: Ontology Alignment using Near-miss Clone Detection

Acknowledgments

This thesis would not have been possible without the help and support of those around

me. There are too many people to list, as everyone who encouraged me to pursue

a Master’s degree helped. However, there are people who I feel it is important to

mention their contributions below.

Firstmost, I would like to thank my supervisors, Dr. Jim Cordy and Dr. Amal

Zouaq, for their support and guidance. Jim has been very supportive in helping

me with NiCad and the low-level technical details. Amal has been a great help

to understand ontologies and provided excellent recommendations on how to move

forward at all stages of my Masters. The support of my two supervisors made this

research possible.

I would like to thank my family for having supported me until I could support

myself, helping me get through the difficult parts of a Master’s program, and their

support in continuing my education.

I would also like to thank my labmates who provided edits, help with figures,

as well as the constructive discussions, and for the occasional distractions. Karolina

provided lots of help editing my thesis, and was always keeping me in line. Doug was

great for providing help with some technical problems. Eric helped me a lot with

administrative matters. Scott provided the kind of insight that can only be gained

ii

Page 4: NeCO: Ontology Alignment using Near-miss Clone Detection

through many years of experience. I’d also like to thank Tawhid, Mark, Elizabeth,

Amal, Gehan, Charu, Matthew, Andrew for the discussion we had in and out of the

lab.

Finally, I would like to thank the professors who hired me as a research assistant

while I was an undergraduate student, Dr. Doug Mewhort and Dr. Bennet Murdock.

They provided me with the experience that made me fall in love with research.

iii

Page 5: NeCO: Ontology Alignment using Near-miss Clone Detection

Contents

Abstract i

Acknowledgments ii

Contents iv

List of Tables vii

List of Figures viii

Chapter 1: Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 2: Background and Related Work 52.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 The Ontology Alignment Problem . . . . . . . . . . . . . . . . . . . . 82.3 Ontology Alignment Tools . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Rule-based Systems . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 132.3.4 String-based Methods . . . . . . . . . . . . . . . . . . . . . . 142.3.5 Graph-based Methods . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Clone Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Clone Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Contextualization . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Chapter 3: Overview 19

iv

Page 6: NeCO: Ontology Alignment using Near-miss Clone Detection

3.1 The Ontology Alignment Challenge . . . . . . . . . . . . . . . . . . . 193.2 Extraction of OWL elements . . . . . . . . . . . . . . . . . . . . . . . 223.3 Contextualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Near-Miss Clone-Detection . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Best-Match Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Chapter 4: OWL Class Extraction 294.1 TXL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.2 Transformation Rules . . . . . . . . . . . . . . . . . . . . . . . 344.1.3 Extracted OWL Classes . . . . . . . . . . . . . . . . . . . . . 36

4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 5: Contexualizing Ontology Entities 375.1 Contextualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 TXL Contextualization Rules . . . . . . . . . . . . . . . . . . . . . . 385.3 Contextualized example . . . . . . . . . . . . . . . . . . . . . . . . . 395.4 Adding Source Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Chapter 6: Detecting Concept Clones 456.1 Clone Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Longest Common Substring Problem . . . . . . . . . . . . . . . . . . 476.3 NiCad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 7: Filtering Process 507.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Chapter 8: An Experiment 558.1 The Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558.2 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568.3 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588.4 Experiment Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 598.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Chapter 9: Results 619.1 2011 Ontology Alignment Results . . . . . . . . . . . . . . . . . . . . 61

v

Page 7: NeCO: Ontology Alignment using Near-miss Clone Detection

9.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689.3 2012 Ontology Alignment Results . . . . . . . . . . . . . . . . . . . . 719.4 Run-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729.5 Blind Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729.6 Similarity Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Chapter 10: Summary and Conclusions 7510.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7510.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7710.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7710.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Bibliography 79

Appendix A: The 2011 Biblio Source Ontology 87

vi

Page 8: NeCO: Ontology Alignment using Near-miss Clone Detection

List of Tables

9.1 Alignment tool results on the Biblio dataset . . . . . . . . . . . . . . 62

9.2 Traditional clone-detection results . . . . . . . . . . . . . . . . . . . . 63

9.3 NeCO’s results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9.4 Average for test levels . . . . . . . . . . . . . . . . . . . . . . . . . . 65

9.5 Tools’ precision on the 2011 Biblio dataset . . . . . . . . . . . . . . . 67

9.6 Tools’ recall for the 2011 dataset . . . . . . . . . . . . . . . . . . . . 67

9.7 Tools’ F-measure results for the 2011 dataset . . . . . . . . . . . . . . 68

9.8 NeCO’s statistics for different alterations to the ontology . . . . . . . 69

9.9 NeCO results for the 2012 Biblio dataset . . . . . . . . . . . . . . . . 71

9.10 Statistics for NeCO’s results for the 2012 Biblio Dataset by test level 71

9.11 Tools’ execution time for the 2011 Biblio dataset in minutes . . . . . 73

vii

Page 9: NeCO: Ontology Alignment using Near-miss Clone Detection

List of Figures

2.1 Example excerpt from an OWL ontology in RDF/XML . . . . . . . . 9

3.1 Outline of our alignment method . . . . . . . . . . . . . . . . . . . . 21

3.2 An example OWL ontology . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 OWL elements extracted from the sample ontology . . . . . . . . . . 24

3.4 Example of the contextualization of the “MastersThesis” OWL class . 26

3.5 Example of clone detection results . . . . . . . . . . . . . . . . . . . . 27

3.6 Best-matches selected from Figure 3.5 . . . . . . . . . . . . . . . . . . 28

4.1 Extraction of OWL classes . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Excerpt from the grammar - headers and common elements . . . . . . 32

4.3 Grammar rules for the extraction of classes . . . . . . . . . . . . . . . 33

4.4 TXL Tranformation main function . . . . . . . . . . . . . . . . . . . 35

4.5 Extracted classes from Figure 2.1 . . . . . . . . . . . . . . . . . . . . 35

5.1 Contextualization process . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 OWL classes extracted from the botany ontology . . . . . . . . . . . 39

5.3 TXL rules for inlining . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4 Example OWL concept contextualization . . . . . . . . . . . . . . . . 41

5.5 The rule addSourceTags . . . . . . . . . . . . . . . . . . . . . . . . . 43

viii

Page 10: NeCO: Ontology Alignment using Near-miss Clone Detection

5.6 Extracted classes with source tags . . . . . . . . . . . . . . . . . . . . 44

6.1 Clone-detection step . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Two strings for finding a common substring . . . . . . . . . . . . . . 47

7.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.2 Clone report returned by NiCad . . . . . . . . . . . . . . . . . . . . . 53

8.1 An RDF alignment for two ontologies . . . . . . . . . . . . . . . . . . 58

8.2 An html alignment for two ontologies . . . . . . . . . . . . . . . . . . 58

9.1 Traditional clone detection’s statistics for varying thresholds . . . . . 64

9.2 NeCO’s statistics for varying thresholds . . . . . . . . . . . . . . . . . 64

9.3 Chart comparing alignment tools with the 2011 dataset . . . . . . . . 66

ix

Page 11: NeCO: Ontology Alignment using Near-miss Clone Detection

1

Chapter 1

Introduction

In this chapter, we introduce the thesis topic by providing our motivations, contribu-

tions, and an overview of the remainder of the thesis.

1.1 Motivation

The Semantic Web is an idea that proposes extending the current day web to present

data that can be comprehended by both humans and machines [6]. In order to accom-

plish this task, the World Wide Web Consortium makes proposals for web standards

such as the Extensible Markup Language (XML) [10], the Resource Description Lan-

guage Schema (RDFS) [11], and the Web Ontology Language (OWL) [4].

Within the Semantic Web context, ontologies are a way to describe knowledge so

that the information within an ontology may be cross-referenced with other ontologies

to enable machines to reason over this knowledge. Thus, ontologies represent the

backbone of the Semantic Web. OWL is the language recommended by the W3C to

describe knowledge. It is built on top of XML and RDF [4].

When two ontologies are compared, the task of finding which parts of each ontology

correspond to parts of the other is the task of ontology alignment. The task of ontology

Page 12: NeCO: Ontology Alignment using Near-miss Clone Detection

1.2. OBJECTIVE 2

alignment remains an open problem which we address in this thesis.

Ontology alignment is of interest to researchers who are involved in data integra-

tion, querying and answering. Designing successful algorithms to perform alignments

is one task that may lead to the wide-spread usage of the Semantic Web technolo-

gies [21]. Ontologies can be created by various independent domain experts rather

than a central authority. The problem that arises with this multiplicity of ontologies

is that each ontology author has his or her own biases or needs which leads to many

ontologies describing the same things. For this reason, it is important to provide

automatic ontology alignment tools for any ontology.

1.2 Objective

We approach the alignment problem with a minimalistic approach, in which the

alignment is done on large ontologies by using few computational resources. Our

approach is to perform fast, computationally-light methods of aligning ontologies,

with no prior training, and in a way that can be used on any OWL ontology.

In this thesis, we propose to apply near-miss clone-detection techniques to the

task of ontology alignment. We name our tool NeCO, short for Near-miss Clones

Ontology Alignment. We present the results obtained by NeCO and compare our

results with those of other tools on the same dataset. We then apply NeCO to a new

dataset for comparison with our results from the first dataset.

1.3 Contributions

This thesis provides two contributions. First, it proposes that the research done in

the field of clone-detection may be beneficial to the ontology alignment problem. We

Page 13: NeCO: Ontology Alignment using Near-miss Clone Detection

1.4. OUTLINE OF THE THESIS 3

apply contextualization, a technique from the clone detection community, to a new

problem.

Our second contribution is the notion of finding a single best answer from a number

of clone pairs by comparing similarity values. This is important for ontology align-

ment, as the traditional clone-detection approach of returning all pairs over a given

threshold is not desirable; rather our task requires us to return the fewest number of

possible alignments in order to improve the overall precision.

1.4 Outline of the Thesis

We begin by discussing the background knowledge and a general overview of the way

our tool performs alignments. Chapter 2 discusses the background knowledge relevant

to the Semantic Web, how other tools align ontologies, and a background on clone

detection. This is followed in Chapter 3 by a high-level overview of our alignment

method.

Chapters 4 - 7 go into great detail about each step of our alignment method.

Chapter 4 describes how our method extracts relevant elements from ontologies for

comparison. Chapter 5 describes the task of contextualization, where we replace

references to elements with a description of the element. Chapter 6 describes the

process of detecting how two extracted fragments of code are flagged as possible

alignments. Chapter 7 describes the process of taking all the results from the previous

chapter, and finding the best results from the elements returned by traditional clone-

detection techniques, providing an alignment.

We discuss the evaluation of our implementation in Chapters 8 and 9. Chap-

ter 8 describes how we evaluated NeCO, and provides a description of the dataset.

Page 14: NeCO: Ontology Alignment using Near-miss Clone Detection

1.5. SUMMARY 4

Chapter 9 lists the results obtained from NeCO given the experimental conditions

described in the previous chapter.

We conclude the thesis with Chapter 10, which presents an overview of what was

done, and provides some ideas for future work.

1.5 Summary

This chapter gives an outline for the rest of the thesis and an explanation of why the

research done in this thesis is of importance. The next chapter presents the relevant

background knowledge, other tools that have been used for ontology alignment, and

the topic of clone detection.

Page 15: NeCO: Ontology Alignment using Near-miss Clone Detection

5

Chapter 2

Background and Related Work

In the previous chapter, we presented a short introduction to the thesis, along with

an outline to the rest of the thesis. This chapter presents the background material

relevant to ontology alignment, clone detection, and some of the methods that have

been used to approach this problem and similar problems.

2.1 Ontologies

Data on the web is made so that a machine may present information. The web can

link webpages, but it does not have a built-in mechanism to link relevant information

contained within the webpages from one source to another. This means that the task

of referencing multiple webpages to find the information is done by humans rather

than by using a machine to integrate knowledge bases and query multiple websites.

For example, a banking website’s layout may be represented in html, but the legal

details are presented in a body of text between html tags. A user can search specific

terms to find more information, or read the documentation to understand the terms

and conditions of the website. A user may also use an algorithm to search the text

for a term and its relevant information, but the website is not made to represent

Page 16: NeCO: Ontology Alignment using Near-miss Clone Detection

2.1. ONTOLOGIES 6

knowledge; The information requires an agent to interpret the semantics.

Tim Berners-Lee envisions a future where the web becomes more machine-readable,

so that a machine may link data across multiple data sources and find answers to

queries [6]. He calls the future version of the internet the Semantic Web - where

knowledge is presented in a “semantic” manner that is both machine and human-

readable for information exchange.

The Semantic Web’s building blocks for representing knowledge are ontologies,

which are a collection of information represented in a way to foster machine com-

prehension. An ontology describes a domain, or a particular subject, such as law,

biomedical research or mathematics. The Web Ontology Language (OWL) [4] is the

representation of ontologies recommended by World Wide Web Consortium (W3C)

and allows for the linking of knowledge from other ontologies written in OWL [63].

OWL is built from DAML+OIL and an extension of the Resource Description

Framework (RDF) [4]. It also comes in three varieties: OWL-Lite, a version of OWL

with the smallest vocabulary so as to make a simpler reasoning system; OWL-DL, a

version of OWL that provides the maximum expressiveness whilst still guaranteeing

a computable answer to all queries; and OWL-Full, a version of OWL that uses all

aspects of RDF to be understood as an OWL document, however with no guarantee

that a given query will be computable [63]. OWL 2, an extension to OWL, is being

developed by the W3C for information interchange with the Semantic Web [23, 25].

We do not discuss OWL 2 for this project as results that are significant in OWL

should also be significant in OWL 2.

We are primarily interested in doing research with OWL ontologies because OWL

has greater expressiveness and reasoning capabilities than RDFS. The syntax of OWL

Page 17: NeCO: Ontology Alignment using Near-miss Clone Detection

2.1. ONTOLOGIES 7

is often expressed in RDF/XML as shown in Figure 2.1. The components that make

up OWL ontologies are: classes, individuals, properties, data values. Classes repre-

sent concepts; individuals are instances of a class; properties describe qualities that

individuals have; data values can be numbers, strings, or data in some other form.

Properties define restrictions, and values that may be present within a domain.

A property may be unrestricted, indicating merely the presence of a property; or

carry some restriction, indicating a limit on the values allowable. The properties are

either datatype properties, which define a data value for individuals of a class, or

object properties, which form relationships between classes so that individuals from

their respective classes are linked [4]. Using the example in Figure 2.1, there are two

properties: an ObjectProperty named hasSeeds and a DatatypeProperty named

Color.

Individuals are instances of an object that are relevant in a domain. Individuals

represent a member of a class [63] and may be classified as being part of a single class,

or multiple classes. Each individual has the properties and characteristics defined

by the class definition under which the individual is classified [39]. For example,

“RomaTomatoes” in Figure 2.1 refers to an individual that belongs to the Class

“Tomato”.

Classes describe concepts within a domain [4]. A class can be thought of as a

set. The set contains individuals described by a domain with the properties and

restrictions of the class declared [63]. For example, an ontology for the domain of

botany has a class to describe a “Fruit” concept. A class describing the concept of

“Tomato” would include an XML subClassOf tag to indicate that every individual

that fits under the class “Tomato” is also an individual in the class “Fruit”. The

Page 18: NeCO: Ontology Alignment using Near-miss Clone Detection

2.2. THE ONTOLOGY ALIGNMENT PROBLEM 8

named classes “Tomato” and “Fruit” are shown in Figure 2.1.

For example, “Literature” may be thought of as an example for a domain. “Book”,

and “Research Paper” are concepts within the “Literature” domain, thus the concepts

“Book” and “Research Paper” are represented in the ontology as classes. An object

property for the class “Book” is the existence of an author. A datatype property of

the class “Book” is the “Genre” of a book. An individual is an instance of a class.

“2001: A Space Odyssey” is an individual that fits under the concept “Book”. The

individual can also be assigned a “Genre” value of “Science Fiction”. The individual

“2001: A Space Odyssey” would be linked to the individual “Arthur C. Clarke” of

the class “Author”.

2.2 The Ontology Alignment Problem

The Semantic Web aims to be an open project where anyone may make changes and

additions to ontologies [6]. There may exist many ontologies describing a particular

domain or overlapping domains. Differences arise between the ontologies’ descriptions

due to the number of people who each describe a domain with differing details, levels of

precision, mismatched terminology, and created for dissimilar goals. For example, two

lawyers describing “Maritime Law” could make slightly different ontologies despite

the fact they are describing the same domain. One lawyer may write an ontology to be

far more precise in one aspect of maritime law, another lawyer can write an ontology

that describes the domain in a general way, and a third ontology may omit sections

of maritime law because other sections are not required for the author’s needs.

Ontology alignment is the process of finding correspondences between elements.

Each alignment describes the entities of an ontology that are aligned; the similarity

Page 19: NeCO: Ontology Alignment using Near-miss Clone Detection

2.2. THE ONTOLOGY ALIGNMENT PROBLEM 9

<owl:Class rdf:about="&BotanyExample;Fruit">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;hasSeeds"/>

<owl:someValuesFrom rdf:resource="&BotanyExample;Seeds"/>

</owl:Restriction>

</rdfs:subClassOf>

<rdfs:comment>Describes fruits through botany</rdfs:comment>

</owl:Class>

<owl:Class rdf:about="&BotanyExample;Seeds"/>

<owl:Class rdf:about="&BotanyExample;Tomato">

<rdfs:label>Tomato</rdfs:label>

<rdfs:subClassOf rdf:resource="&BotanyExample;Fruit"/>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;Color"/>

<owl:someValuesFrom rdf:resource="&xsd;string"/>

</owl:Restriction>

</rdfs:subClassOf>

<rdfs:comment>Describes a tomato as a plant</rdfs:comment>

</owl:Class>

<owl:ObjectProperty rdf:about="&BotanyExample;hasSeeds">

<owl:ObjectProperty rdf:ID="hasSeeds">

<rdfs:domain rdf:resource="#Fruit"/>

<rdfs:range rdf:resource="#Seed"/>

</owl:ObjectProperty>

</owl:ObjectProperty>

<owl:DatatypeProperty rdf:about="&BotanyExample;Color"/>

<owl:NamedIndividual rdf:about="&BotanyExample;RomaTomatoes">

<rdf:type rdf:resource="&BotanyExample;Tomato"/>

<Color rdf:datatype="&xsd;string">Red</Color>

<hasSeeds rdf:resource="&BotanyExample;RomaTomatoes"/>

</owl:NamedIndividual>

Figure 2.1: Example excerpt from an OWL ontology in RDF/XML

Page 20: NeCO: Ontology Alignment using Near-miss Clone Detection

2.3. ONTOLOGY ALIGNMENT TOOLS 10

of the elements; and the alignment type: superset, subset, equality or disjoint. For

the sake of simplicity, entities are considered disjoint if no alignment is provided.

Sometimes, entities are somewhat overlapping but still different. For example, a

book editor may have many duties similar to that of a film producer. In aligning the

ontologies of a book publisher ontology, and a film studio ontology, the book editor

and film producer may be partially aligned. Ontology alignment is an open problem

inhibiting the wide-spread adoption of the Semantic Web.

2.3 Ontology Alignment Tools

The Ontology Alignment Evaluation Initiative (OAEI) has made competitions on a

yearly basis since 2003 to evaluate ontology alignment tools. The results for the 2011

OAEI competition [20] were available when we began our experiments, which led to

the 2011 dataset being chosen for our experimentation. The OAEI has since published

the 2012 competition results [1]. In this section, we describe some of the tools used

in previous competitions and an overview of how these tools aligned ontologies.

2.3.1 Background Knowledge

One method for ontology matching tools to perform alignments is to use background

knowledge, such as a machine-readable lexicon. This technique is quite common,

and many of the ontology matchers use information external to the ontology to find

semantically related words[13, 16, 27, 42, 43, 58]. Such tools are also known as external

techniques, as opposed to internal techniques, which use information contained only

within an ontology [27]. Background knowledge techniques are typically used in

conjunction with other techniques to obtain alignments.

Page 21: NeCO: Ontology Alignment using Near-miss Clone Detection

2.3. ONTOLOGY ALIGNMENT TOOLS 11

There are a few ways to access background knowledge. WordNet [40] is one such

tool for finding synonyms, antonyms, and other lexical information of words. Linked

data[7, 60] is an initiative to be able to represent datasets in RDF and OWL and

be able to make connections or links between different RDF and OWL datasets.

The data should be available on the web, be machine and human-readable, and be

written according to W3C standards [5]. Consequently, the combined datasets may

be combined with other datasets for querying and drawing inferences. DBpedia [3] is

one such source of background knowledge.

DBpedia is an open dataset of information extracted from the infoboxes and links

of Wikipedia articles. DBpedia’s information may be combined with other open

datasets to create a larger linked network of data.

The “Friend of a Friend” ontology[12] is a linked data set that describes the

relationships between agents: people, social groups, organizations, or other groups

of people. The idea is to link the agents listed above to each other. An entry of an

Agent may include information like their work, membership within groups, interests,

and other information about the agent.

WikiMatch is an example ontology alignment tool which uses information from

Wikipedia [27]. The tool uses the links between Wikipedia articles of one language

to its equivalent in another language to identify its translation in different languages,

and Wikipedia’s search function to find similarities between classes, and properties.

If the classes or properties across ontologies are above a certain threshold, then the

classes or properties are aligned. The matcher compares the terms both Wikipedia

articles share in common, and if enough of the terms are the same, they are aligned.

Page 22: NeCO: Ontology Alignment using Near-miss Clone Detection

2.3. ONTOLOGY ALIGNMENT TOOLS 12

MaasMatch [53] is a matching algorithm that finds the synonyms of every ex-

tracted element’s labels. The tool uses WordNet to calculate a similarity based on

the synonyms in common between entities, and returns an alignment based on the

highest similarities.

2.3.2 Rule-based Systems

Rule-based alignment tools use rules to infer ontology alignments [16, 17, 24, 36, 42].

Two examples of rule-based tools are provided below.

CIDER [24] is an ontology matching system that finds lexical similarities in order

to be able to infer more about the semantics of the ontology. A similarity value is

computed based on the Levenhstein distance of the lexical similarity between labels,

and a value based on vector space modelling [46] for the other inputs: similarity of

comments, and other lines which make a description; similarity between each term’s

hyponyms; similarity between each term’s hypernyms; and the similarity of the prop-

erties between words described in the extracted features of an ontology. The alignment

tool then proceeds to feed the above mentioned features into the input layer of a neu-

ral network. There is a separate neural network for aligning classes, and another for

aligning properties. The output layer of the neural network produces a value matrix

of similarity values between terms. The tool then finds the highest similarity value

for each term. If the highest value is above a specific threshold, then CIDER returns

an alignment between the two terms.

AROMA [17] is an algorithm which finds relevant terms to words and database

schema matching methods for the use of ontology alignment. The first step of the

algorithm is to apply association rules [22]. The tool builds a list of relevant ‘terms’

Page 23: NeCO: Ontology Alignment using Near-miss Clone Detection

2.3. ONTOLOGY ALIGNMENT TOOLS 13

associated with a ‘concept’, in this case, a class or property. The tool computes a

value of how related two concepts are based on their common terms. The tool also

creates a hierarchy of terms so that it may relate a concept’s terms to its children

and parents to help with alignment. The second step of the algorithm is relating how

similar a particular concept and its children are by comparing another concept and

its children, and finding the terms in common and taking into account the percentage

of terms that both concepts do not share. Also, if two concepts’ parents are aligned

with one another, then the children concepts are more likely to be aligned by the tool.

2.3.3 Information Retrieval

Information retrieval techniques find similarity values of related terms even if they

have not co-occurred together. Some techniques put terms together into categories [8]

which may reflect that the terms have a common grouping. Some alignment tools use

information retrieval methods to help provide alignments [28, 53, 58, 59].

One tool that finds results with information retrieval is First, AgreementMaker [16].

AgreementMaker uses a lexicon to compute and extract features of the ontologies,

such as longest common substring, to compute a similarity matrix from term fre-

quency - inverse document frequency vectors and their cosine values, as well as other

features. The second “layer” matches structural similarities between ontologies. The

third step weighs the similarity values from the first and second steps to provide a

1-1, 1-n, or n-m alignment.

Page 24: NeCO: Ontology Alignment using Near-miss Clone Detection

2.3. ONTOLOGY ALIGNMENT TOOLS 14

2.3.4 String-based Methods

There are methods that treat the ontology as text [16, 24, 35] rather than trying to

treat OWL files semantically.

Hertuda [26] is a simple matcher that tokenizes strings, then uses a Desmerau-

Levenshtein distance to return similarities between entities in the ontologies. If the

similarities are above a threshold, then the two entities are aligned.

DDSim [42] creates a similarity measure between every entity of the same type.

The similarity is computed with a measure called the Jaccard index, finds hypernyms,

and uses mathematical techniques to find values for missing data, and inconsistent

data.

2.3.5 Graph-based Methods

Many ontology alignment tools treat the ontology alignment task as a graph-based

problem [29, 33, 62]. These methods typically find similarities between some parts

of the ontologies, then look at the structure of the ontology as a whole, or at the

neighbours of aligned entities.

Ontology Mapping by Evolutionary Programming (MapEVO) and Ontology Map-

ping using Discrete Particle Swarm Optimisation (MapPSO) [9], treat alignments as

optimization problems. MapPSO uses swarm algorithms to find an optimal alignment

for individual alignments. MapEVO selects alignments based on a measure of fitness.

MapSSS [13] is a tool that works in two steps, and a last step that had yet to be

integrated in the 2011 competition. If a match is completed at any of these steps, the

tool performs an alignment on subgraphs resulting from the removal of the aligned

element. The first step is a syntactical comparison, which returns exact matches for

Page 25: NeCO: Ontology Alignment using Near-miss Clone Detection

2.4. CLONE DETECTION 15

elements after some preprocessing. The second step, structural, assumes the previous

alignments are correct, then finds if there are any elements which logically must be

aligned. For example, if an element has two neighbours, and both neighbours are

aligned to one another in both ontologies, then the elements must be aligned.

Optima [58] treats the ontology as a directed graph with named classes for nodes,

and edges as properties between classes. For each potential alignment, a single value

is derived from syntactical features, using the Smith-Waterman algorithm [56], which

obtains a value based on the longest common substrings, single value decomposition

to find relationships between terms within WordNet, and other features. The tool

generates a matrix where each column represents the classes from one ontology, and

the rows represent the classes from the other ontology. A 0 is assigned to each entry

in the matrix, representing no alignment. An alignment, changing the value to 1

in the alignment matrix, is determined with Dempster’s expectation-maximization

algorithm [18] for the most likely alignments using the information already contained

in the alignment matrix using prior probabilities, as well as aligning neighbours of

already aligned elements. The expectation maximization algorithm is run iteratively

until the tool terminates with a final alignment.

2.4 Clone Detection

A clone is one type of code smell that indicates that code was copied from one source

to another. The task of finding fragments of code that were copied from one source

to another is the task of clone detection [50].

There are multiple ways to detect software clones [52]. Text-based clone-detection

methods compare two code fragments to each other as strings and finds the string

Page 26: NeCO: Ontology Alignment using Near-miss Clone Detection

2.4. CLONE DETECTION 16

distance between the code fragments [50]. Token-based clone detection uses tokens

within the programming language to find clones by comparing two tokens from poten-

tial clones. Abstract syntax tree (AST) techniques create a parse-tree for a program,

and finds subtrees that correspond to each other. Program Dependency Graph (PDG)

techniques take a program’s dependency graph and finds subgraphs that correspond

to each other.

We examine the possibility of treating OWL as source-code, and we want to see

how effective clone-detection techniques are at aligning ontologies. Clone-detection’s

goals are to detect clones that are exact duplicates; have some aspects renamed, but

structurally identical; or copied fragments of code with some lines of code added or

removed [50].

The detection of clones was used to find Web Service Description Language

(WSDL) clones with a process called contextualization [38]. WSDL source code

refers to services in other parts of the source code, and in order to reduce the number

of false positives, the research contextualized the clones so that the WSDL lines of

code that refer to other sections of the source code are replaced by the block of code

being referenced. The contextualized fragments of code have more information than

non-contextualized fragments for the detection of clones. Metric-based techniques use

information about the source code to find clones. Hybrid techniques combine some

of the techniques above, and create a new method of finding clones [50].

Delta-p [57] is a tool for comparing software models that transforms UML dia-

grams into RDF for comparison and for querying.

Page 27: NeCO: Ontology Alignment using Near-miss Clone Detection

2.4. CLONE DETECTION 17

2.4.1 Clone Types

These include a few different types of code clones [50]. There are exact clones, which

are code fragments copied from one source to another; renamed clones, which have

changes made to the whitespace and identifiers; parameterized clones which have

identifiers systematically renamed; near-miss clones with newlines, whitespace, re-

naming of variables (parameterized and renamed clones are near-miss clones); and

gapped clones which have some code removed or inserted. The clones listed above

are the clones that are of interest to this project.

Near-miss clone-detection is the process of finding approximate clones of source

code. NiCad [51], the near-miss string-based clone detector we use for our results, uses

an entire line of code. If a single character between the lines of code is different, then

the entire line is flagged as being different. The tool then calculates what percentage

of lines are identical. The tool returns a list of code fragments whose percentage of

lines are above the threshold, which can be set by the user.

2.4.2 Contextualization

OWL refers to other fragments of code within an OWL file, and as a result, the

information is not localized. Because much of the information about a fragment of

OWL code is found elsewhere in the OWL file, it is necessary to find the references

to other parts of the OWL file, and insert the information so that it may be available

to the clone detector.

Contextualization is the process of describing the super-classes of each class in

order to describe a class in context. Contextualization was first used for (WSDL) [38],

which has the same problems in regards to its non-localized nature. Many classes are

Page 28: NeCO: Ontology Alignment using Near-miss Clone Detection

2.5. SUMMARY 18

only a single line long, and do not contain enough information to be able to match a

class with another class.

2.5 Summary

We reviewed the ontology alignment challenge and the tools that approach the chal-

lenge. The next chapter provides an overview of our alignment process, including a

high-level description of how our alignment tool works.

Page 29: NeCO: Ontology Alignment using Near-miss Clone Detection

19

Chapter 3

Overview

Chapter 2 detailed some background on the subject, other tools that are used for

ontology alignment, and a description of existing clone-detection techniques. In this

chapter, we provide an overview of the approach we use for the challenge of ontology

alignment.

3.1 The Ontology Alignment Challenge

As previously described, one challenge facing wide-scale use of ontologies is the dif-

ficulty in identifying what parts of different ontologies correspond to one another.

Ontology alignment, also known as the ontology matching problem, is “... the pro-

cess of finding relationships or correspondences between elements of different ontolo-

gies.” [21]. The task of finding these equivalent parts in ontologies is called the

ontology alignment challenge - the problem which we face in this thesis.

For example, a medical doctor may have to describe concepts in medicine, and

a biochemist may use an ontology describing biochemistry. These domains have

overlapping concepts, and an alignment is desired. The alignment permits the medical

doctor to better understand the biochemistry terms and computer programs that

Page 30: NeCO: Ontology Alignment using Near-miss Clone Detection

3.1. THE ONTOLOGY ALIGNMENT CHALLENGE 20

use the medical ontology can use the alignment and biochemistry ontology to their

purposes. The ontological components may be named differently, or the ontologies

may have differing levels of detail to suit the needs of their respective fields.

As previously described in Chapter 2, a number of solutions have been proposed

for the ontology alignment task. The problems with many existing alignment tools

are that some tools are slow, some algorithms use large external datasets, use matrix

operations which require large amounts of computer resources, or require training, a

task that may not always be possible due to the unavailability of training data.

In this thesis we propose a lightweight approach, applying code similarity tech-

nology (“clone detection”) to find similar parts of the ontologies as candidates to be

aligned. Our process consists of four main phases, as shown in Figure 3.1.

In the first step, the classes of the two ontologies to be aligned are extracted. The

second step replaces references to other classes of the ontology by their actual defini-

tion to create self-contained fragments - a process called contextualization. The third

step puts both contextualized extracted fragments into a clone detector, and finds

similar ontology fragments. Step four filters entities in the first and second ontology

to find best-matches, returning the final output: a proposed ontology alignment.

Our tool is designed to work on any pair of ontologies in OWL, regardless of the

subject of the ontologies, and to return an answer in a minimal amount of time.

Each step of our ontology alignment method is described in this chapter. Subsequent

chapters go into detail of each step, represented by ovals in Figure 3.1.

Page 31: NeCO: Ontology Alignment using Near-miss Clone Detection

3.1. THE ONTOLOGY ALIGNMENT CHALLENGE 21

Figure 3.1: Outline of our alignment method

Page 32: NeCO: Ontology Alignment using Near-miss Clone Detection

3.2. EXTRACTION OF OWL ELEMENTS 22

3.2 Extraction of OWL elements

Clone detection requires a target granularity to be extracted and compared for sim-

ilarity. For example, code blocks, functions or classes are extracted when using pro-

gramming languages and represent different granularities. In the ontology alignment

problem, the units to be aligned are classes, properties and individuals. Each ex-

tracted element is called a potential clone. This thesis presents the findings on the

extraction and alignment of classes. The first step of our process is the identification

and extraction of the OWL class elements as potential clones.

Figure 3.2 shows a simple example OWL ontology, and Figure 3.3 the OWL classes

extracted from this sample ontology. In the example from Figure 3.3, each potential

clone is contained within a source tag. Each clone tag keeps track of the potential

clone’s original file, and the location in the file with the startline and endline

attributes.

The extracted elements are passed into the next step of the algorithm.

3.3 Contextualization

Similarity using clone detection techniques can only work when all of the relevant

attributes of the units to be compared are directly present in the unit representation.

Unfortunately, most of the attributes of OWL elements are described in OWL notation

by reference, using references to their attributes defined elsewhere in the ontology.

One challenge we face is to increase the information in each class for the task of

ontology alignment.

As highlighted earlier, contextualization [38] is a process previously described for

the WSDL modelling language, in which referenced attributes are copied to their

Page 33: NeCO: Ontology Alignment using Near-miss Clone Detection

3.3. CONTEXTUALIZATION 23

<?xml version="1.0"?>

<!DOCTYPE rdf:RDF [ ]>

<rdf:RDF>

<owl:Ontology rdf:about="">

</owl:Ontology>

<owl:Class rdf:ID="Reference">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="#date"/>

<owl:maxCardinality rdf:datatype=

"&xsd;nonNegativeInteger">

1

</owl:maxCardinality>

</owl:Restriction>

</rdfs:subClassOf>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="#title"/>

<owl:maxCardinality rdf:datatype=

"&xsd;nonNegativeInteger">

1

</owl:maxCardinality>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

<owl:Class rdf:ID="Academic">

<rdfs:subClassOf rdf:resource="#Reference"/>

</owl:Class>

<owl:Class rdf:ID="MastersThesis">

<rdfs:subClassOf rdf:resource="#Academic" />

</owl:Class>

</rdf:RDF>

Figure 3.2: An example OWL ontology

Page 34: NeCO: Ontology Alignment using Near-miss Clone Detection

3.3. CONTEXTUALIZATION 24

<source file="Example.rdf" startline="7" endline="26">

<owl:Class rdf:ID="Reference">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="#date"/>

<owl:maxCardinality rdf:datatype=

"&xsd;nonNegativeInteger">

1

</owl:maxCardinality>

</owl:Restriction>

</rdfs:subClassOf>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="#title"/>

<owl:maxCardinality rdf:datatype=

"&xsd;nonNegativeInteger">

1

</owl:maxCardinality>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

</source>

<source file="Example.rdf" startline="27" endline="29">

<owl:Class rdf:ID="Academic">

<rdfs:subClassOf rdf:resource="#Reference"/>

</owl:Class>

</source>

<source file="Example.rdf" startline="30" endline="32">

<owl:Class rdf:ID="MastersThesis">

<rdfs:subClassOf rdf:resource="#Academic"/>

</owl:Class>

</source>

Figure 3.3: OWL elements extracted from the sample ontology

Page 35: NeCO: Ontology Alignment using Near-miss Clone Detection

3.4. NEAR-MISS CLONE-DETECTION 25

references in order to localize attributes and thus make the units to be compared

amenable to clone analysis.

The second step in our process contextualizes our extracted OWL elements in this

way, recursively inlining referenced OWL classes of each extracted OWL element to

localize all of its attributes. Figure 3.4 shows the result of contextualization of the

extracted elements of Figure 3.3.

After the contextualization step, each element has all the information contained

within it to provide a full description, rather than referring to other constructs.

3.4 Near-Miss Clone-Detection

Once the elements to be compared have been extracted and contextualized, we search

them for clones using near-miss clone detection. Near-miss clone detection is a method

for analyzing similarity in textual code fragments in which small differences are ig-

nored, allowing for variations in the representation. This corresponds well to the

alignment problem - two ontologies for the same concepts may vary in the level of

detail and may use different naming conventions, in much the same way that imple-

mentations of the same function in different web browsers’ source code may vary in

statement ordering, variable naming, comments and so on.

Near-miss clone-detection identifies sets of similar fragments called clone pairs.

For any particular code fragment, there may be many other code fragments which are

similar to varying degrees, up to the threshold difference. In our first experiments,

we treat all of the OWL elements identified as similar to an element as potential

alignments. While this can yield a high level of recall, that is to say we find most of

the possibilities; it gives us a very low precision, that is, we give too many possible

Page 36: NeCO: Ontology Alignment using Near-miss Clone Detection

3.4. NEAR-MISS CLONE-DETECTION 26

<owl:Class rdf:ID="MastersThesis">

<rdfs:subClassOf rdf:resource="#Academic" />

</owl:Class>

(a) Unmodified example OWL class

<owl:Class rdf:ID="MastersThesis">

<rdfs:subClassOf>

<owl:Class rdf:ID="Academic">

<rdfs:subClassOf>

<owl:Class rdf:ID="Reference">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="#date"/>

<owl:maxCardinality rdf:datatype=

"&xsd;nonNegativeInteger">

1

</owl:maxCardinality>

</owl:Restriction>

</rdfs:subClassOf>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="#title"/>

<owl:maxCardinality rdf:datatype=

"&xsd;nonNegativeInteger">

1

</owl:maxCardinality>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

</rdfs:subClassOf>

</owl:Class>

</rdfs:subClassOf>

</owl:Class>

(b) Fully contextualized example OWL class from (a)

Figure 3.4: Example of the contextualization of the “MastersThesis” OWL class

Page 37: NeCO: Ontology Alignment using Near-miss Clone Detection

3.5. BEST-MATCH FILTERING 27

<clone nlines="50" similarity="85">

<source file="Onto1.owl" startline="5" endline="10" pcid="1"></source>

<source file="Onto2.owl" startline="5" endline="9" pcid="3"></source>

</clone>

<clone nlines="45" similarity="77">

<source file="Onto1.owl" startline="12" endline="18" pcid="1"></source>

<source file="Onto2.owl" startline="5" endline="9" pcid="4"></source>

</clone>

<clone nlines="40" similarity="87">

<source file="Onto1.owl" startline="12" endline="18" pcid="2"></source>

<source file="Onto2.owl" startline="11" endline="16" pcid="4"></source>

</clone>

Figure 3.5: Example of clone detection results

answers, which is not as useful. Figure 3.5 shows the result of using clone detection

to find all of the elements in our example that are similar to another element.

The <clone> tag shows the details of a clone pair. A similarity value is shown

for each potential pair of fragments of ontologies within the <clone> tag. Within the

<source> tag, the clone detector assigns an individual identification number for each

potential clone, the pcid attribute.

The next step of our algorithm approaches how we reduce the number of answers

returned so as to have better predictive power.

3.5 Best-Match Filtering

As previously explained, using near-miss clone-detection alone for the task of ontology

alignment yields too many results, proposing an alignment with very poor precision.

Step four of our algorithm addresses this issue by filtering the results of the clone

detection to identify only the set of similar elements that have the highest similarity,

ignoring those of lesser similarity. This technique, called best-match filtering, yields a

Page 38: NeCO: Ontology Alignment using Near-miss Clone Detection

3.6. EXPERIMENT 28

<clone nlines="50" similarity="85">

<source file="Onto1.owl" startline="5" endline="10" pcid="1"></source>

<source file="Onto2.owl" startline="5" endline="9" pcid="3"></source>

</clone>

<clone nlines="40" similarity="87">

<source file="Onto1.owl" startline="12" endline="18" pcid="2"></source>

<source file="Onto2.owl" startline="11" endline="16" pcid="4"></source>

</clone>

Figure 3.6: Best-matches selected from Figure 3.5

much more precise set of potential alignments - possibly at the cost of missing other

good ones.

Figure 3.6 shows the result of filtering the clone detection results in Figure 3.5

to yield only the best match results, to be specific, the clone pair with the highest

similarity.

3.6 Experiment

Using the best-match clone detection technique, we evaluated our implementation of

our method, NeCO, against existing ontology alignment tools in Chapter 9.

3.7 Summary

In this chapter we have outlined our new process for ontology alignment based on near-

miss clone detection. Using a running example, we trace the steps of our process and

give a quick overview of the steps. In the following chapters, we describe the details of

each step and experiment to compare our method to other ontology alignment tools.

Page 39: NeCO: Ontology Alignment using Near-miss Clone Detection

29

Chapter 4

OWL Class Extraction

In the previous chapter, we provided an overview of the method we developed to

approach the ontology alignment problem. We discussed the steps in a high-level

manner. In this chapter we present TXL grammars to describe OWL documents and

the extraction rules for finding the OWL classes of interest. This process implements

the extraction shown in Figure 4.1, which shows the input, an ontology combined

with an OWL grammar and TXL rules which produce the output, the extracted

OWL classes.

4.1 TXL

NiCad [51], a tool we are using to compute a similarity value discussed in Chapter 6,

requires certain format of the input. In order to extract named classes, we must

choose a language which can extract text with patterns. This step can be done with

programming languages such as XSLT or Perl, however we used a programming lan-

guage called TXL [14]. TXL is a rule-based programming language designed for text

transformation. We chose this programming language because our clone detection

technique requires that rules be written in TXL to extract the relevant information.

Page 40: NeCO: Ontology Alignment using Near-miss Clone Detection

4.1. TXL 30

Figure 4.1: Extraction of OWL classes

The clone detection technique will be discussed in Chapter 6.

4.1.1 Grammar

The use of TXL for the extraction of OWL classes requires a grammar for identifying

the OWL fragments of interest. OWL’s syntax was written in as a TXL grammar in

order to be able to parse an OWL ontology [4]. The grammar can determine what

granularity should be extracted. In this research, we are only concerned with extract-

ing named classes. A class can be represented in one of many different ways [4]. A

named class is a class with a URI. Other classes include an enumeration of individuals

Page 41: NeCO: Ontology Alignment using Near-miss Clone Detection

4.1. TXL 31

that make up a class, a list of restrictions on properties, the intersection of multiple

classes, the union of multiple classes, and the complement of a class. The alignment of

the other 5 types of classes are left for future work. The ontology alignment problem

refers to the alignment of classes, properties, and individuals, however in this work

we only align named classes.

The first type of class can be referenced by its URI with the subClassOf tag. A

subClassOf tag indicates that all the individuals belonging to a particular class also

belong to its parent classes [11].

Other granularities can be specified, including parts of classes, multiple classes,

instances, and properties. The alignment of properties and instances is beyond the

scope of this thesis, however the grammar parses them for future work. Figure 4.2

contains an excerpt of the grammar created for OWL ontologies.

The grammars for TXL are context-free grammars made of terminal and non-

terminal rules [14]. Rules are created with the define keyword, with non-terminal

tokens in [ ]. Every grammar requires a base non-terminal program, then a set of

rules defined for the specific language. In Figure 4.2, the OWL grammar contains the

information expressed at the beginning of every OWL file: namespaces, doctypes,

xmlVersion. Moreover, defining the pattern for comments was necessary for the

removal of all comments.

TXL removes the comments when running the program, a task most clone de-

tection techniques perform [50], including the clone detection technique we use. Fig-

ure 4.2 shows a part of the grammar generated for OWL.

The most important part for this experiment is how the grammar identifies named

classes, the non-terminal element namedClass. Figure 4.3 shows some of the grammar

Page 42: NeCO: Ontology Alignment using Near-miss Clone Detection

4.1. TXL 32

define program

[xmlVersion][NL]

[opt namespaces][ontologyHeaders]

end define

comments

<!-- -->

<rdfs:comment> </rdfs:comment>

end comments

define namespaces

’<!DOCTYPE [opt doctypes] ’[ [NL][IN]

[repeat entity][EX]

’]> [NL]

end define

define doctypes

[SP] [doctypesSpecific] [SP]

end define

define xmlVersion

’<?xml [SP] ’version= [stringlit] [opt encoding] ’?>

end define

define entity

’<!ENTITY [SP] [id] [SP] [stringlit] [SP] ’> [NL]

end define

define ontologyHeaders

< ’rdf:RDF [NL][IN]

[repeat headers] [SP] ’> [NL]

[repeat axiom] [EX]

</ ’rdf:RDF> [NL]

end define

Figure 4.2: Excerpt from the grammar - headers and common elements

Page 43: NeCO: Ontology Alignment using Near-miss Clone Detection

4.1. TXL 33

define axiom

[classAxiom]

| [propertyAxiom]

| [individualFacts]

end define

define owlClass

[namedClass]

| [unnamedClass]

end define

define namedClass

[classID]

|

[classAbout]

end define

define unnamedClass

<owl:Class> [NL][IN]

[repeat statement] [EX]

</owl:Class>[NL]

end define

define classID

’<owl:’Class [rdfId] ’/> [NL]

|

’<owl: ’Class [rdfId] ’> [NL][IN]

[repeat statement] [EX]

’</owl: ’Class> [NL]

end define

define classAbout

’<owl:Class [rdfAbout] ’/>

|

’<owl: ’Class [rdfAbout] ’> [NL][IN]

[repeat statement] [EX]

’</owl: ’Class> [NL]

end define

Figure 4.3: Grammar rules for the extraction of classes

Page 44: NeCO: Ontology Alignment using Near-miss Clone Detection

4.1. TXL 34

that deals with classes.

A class element can have two forms: classID and classAbout. The difference

between these two forms lies in the identifying attribute, whether the class is named

with the rdf:ID attribute, or the rdf:about attribute. The rdf:ID attribute refers

to a local designation, and rdf:about refers to an absolute URI. For this thesis, we

change all rdf:about attributes to rdf:ID under an assumption that the classes are

local, or otherwise are contained within a different ontology.

4.1.2 Transformation Rules

The transformation rules format the fragments of OWL so that they may be analyzed

by the clone detector, explained in chapter 6. Figure 4.4 shows the main function that

is run by TXL. The function calls rules to prepare the data for extraction, extract

the classes, and inline the program, discussed in Chapter 5.

Before the extraction of classes, all rdf:about attributes are changed to rdf:ID

attributes, represented in Figure 4.4 as the [AboutToID] rule. This is done to avoid

the need to create extra rules for extraction, so that all named classes are assumed

to have rdf:ID attributes.

The line _ [^ NewP] within Figure 4.4 finds all fragments of code that fit under

classID and keep them stored in a variable called ClassesWithoutPreprocessing.

The contents of ClassesWithoutPreprocessing are shown in Figure 4.5, which con-

cludes the extraction process.

Page 45: NeCO: Ontology Alignment using Near-miss Clone Detection

4.1. TXL 35

function main

replace [program]

P [program]

construct NewP [program]

P [convertSingletons][AboutToID]

construct ClassesWithoutPreprocessing [repeat classID]

_ [^ NewP]

construct ClassesToInline [repeat classID]

_ [^ NewP]

construct InlinedProgram [program]

ClassesToInline [inline ClassesWithoutPreprocessing]

by

InlinedProgram [addSourceTags][removeDuplicates]

end function

Figure 4.4: TXL Tranformation main function

<owl:Class rdf:ID="Fruit">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;hasSeeds"/>

<owl:someValuesFrom rdf:resource="&BotanyExample;Seeds"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

<owl:Class rdf:ID="Tomato">

<rdfs:subClassOf rdf:resource="&BotanyExample;Fruit"/>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;Color"/>

<owl:someValuesFrom rdf:resource="&xsd;string"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

Figure 4.5: Extracted classes from Figure 2.1

Page 46: NeCO: Ontology Alignment using Near-miss Clone Detection

4.2. SUMMARY 36

4.1.3 Extracted OWL Classes

Figure 4.5 shows the final step of the extraction process. Each <owl:Class> tag is a

named class from the ontology. In this step, we found the named classes that could

be aligned to another class from another ontology.

4.2 Summary

In this chapter, we described the input, the grammar created to describe an OWL

ontology, and began explaining the transformation rules used by TXL to extract

elements. These steps encompass the procedures shown in Figure 4.1. The next

chapter discusses the contextualization process, the manner in which we localize the

information contained within a particular extracted fragment of OWL code, and how

we formatted the code fragments to be used by a clone detection method, discussed

in chapter 6.

Page 47: NeCO: Ontology Alignment using Near-miss Clone Detection

37

Chapter 5

Contexualizing Ontology Entities

In the previous chapter, we discussed OWL ontologies and elements of interest for on-

tology alignment, the grammar created for TXL to describe OWL, and the extraction

process. In this chapter, we explain the process of contextualization. This process

means replacing references with their actual contents. The contextualization process

adds information contained within the ontology so that there is more information for

a given class. This step is shown in Figure 5.1

Figure 5.1: Contextualization process

Page 48: NeCO: Ontology Alignment using Near-miss Clone Detection

5.1. CONTEXTUALIZATION 38

5.1 Contextualization

Elements in OWL ontologies often refer to other elements within the same or another

OWL ontology. For instance, a class that is a subclass of another class may only be a

single line long. The referencing poses problems for clone detection (see Chapter 6),

because not enough information is available for making alignments. As discussed in

Chapter 2, one way to circumvent this problem is to use external sources to gather

more information about the referenced elements; however, gathering information from

external sources is typically slower than using the information contained within an

ontology. The way we approach this problem is through contextualization.

An OWL ontology contains references to other elements within the ontology.

When a line referring its parent is encountered, we replaced that particular line with

a description of its parent. As a result, OWL classes contain more lines of code and

hence more information for clone detection and, in turn, ontology alignment.

Figure 5.2 shows the input for this process of our ontology alignment method.

5.2 TXL Contextualization Rules

The <rdfs:subClassOf> tag can refer to other classes within the ontology. We create

TXL rules to find the classes that <rdfs:subClassOf> refers to, and to replace those

lines with the OWL code that constructs them.

We developed TXL rules to find all the subClassOf tags with an rdf:resource

attribute referring to a named class. The rules then found the entire class being

referenced, and replace of the subClassOf tag with the referenced named class.

The rule is shown in Figure 5.3. In TXL, a rule is applied to a text file until

its conditions can no longer be met. The replace [subClassOf] keyword defines a

Page 49: NeCO: Ontology Alignment using Near-miss Clone Detection

5.3. CONTEXTUALIZED EXAMPLE 39

<owl:Class rdf:ID="Fruit">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;hasSeeds"/>

<owl:someValuesFrom rdf:resource="&BotanyExample;Seeds"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

<owl:Class rdf:ID="&BotanyExample;Seeds">

</owl:Class>

<owl:Class rdf:ID="Tomato">

<rdfs:subClassOf rdf:resource="Fruit">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;Color"/>

<owl:someValuesFrom rdf:resource="&xsd;string"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

Figure 5.2: OWL classes extracted from the botany ontology

pattern that must be changed. The deconstruct ClassResource,

construct SubClassID [stringlit], and construct IDtoMatch [rdfId] sections

find and construct a string for the class being referred. The section

deconstruct * [classID] AllClasses finds referenced class before having done

any contextualization. Finally, by describes the final output of the inlining rule.

5.3 Contextualized example

Figure 5.2 shows extracted fragments. The extracted fragments represent the

classes that can be aligned to another element from another ontology.

Figure 5.4 shows the class “Tomato”. “Tomato” contains two <subClassOf>

tags. The subClassOf tag with attribute,rdf:resource="&BotanyExample;Fruit",

Page 50: NeCO: Ontology Alignment using Near-miss Clone Detection

5.3. CONTEXTUALIZED EXAMPLE 40

rule inline AllClasses [repeat classID]

% Find each subclass reference

replace [subClassOf]

<rdfs: subClassOf ClassResource [rdfResource] />

% Get the name of the referred class

deconstruct ClassResource

rdf: resource= SubClassIDWithHash [stringlit]

construct SubClassID [stringlit]

SubClassIDWithHash [deleteHashPrefix]

construct IDtoMatch [rdfId]

rdf: ID= SubClassID

% Find the definition of that class in the set of all classes

deconstruct * [classID] AllClasses

BeginFilename [srcfilename] BeginLinenumber [srclinenumber]

<owl: Class IDtoMatch >

S [repeat statement]

EndFilename [srcfilename] EndLinenumber [srclinenumber]

</owl: Class>

% Inline the class definition in the subclass reference

by

<rdfs: subClassOf>

<owl: Class rdf: ID= SubClassID >

S

</owl: Class>

</rdfs: subClassOf>

end rule

Figure 5.3: TXL rules for inlining

refers to another named class within the ontology. The <subClassOf> tag without

an attribute contains local property unique to “Tomato”.

The <rdfs:subClassOf rdf:resource="&BotanyExample;Fruit"/> tag refers

to the class “Fruit”. The contextualization changes the single-line subClassOf tag to

a multiple line description of the subclass with the class definition , and puts between

the two tags the lines describing “Fruit”. The second <rdfs:subClassOf> tag does

Page 51: NeCO: Ontology Alignment using Near-miss Clone Detection

5.3. CONTEXTUALIZED EXAMPLE 41

<owl:Class rdf:about="&BotanyExample;Tomato">

<rdfs:subClassOf rdf:resource="&BotanyExample;Fruit"/>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;Color"/>

<owl:someValuesFrom rdf:resource="&xsd;string"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

(a) Unmodified OWL class

<owl:Class rdf:about="&BotanyExample;Tomato">

<rdfs:subClassOf>

<owl:Class rdf:about="&BotanyExample;Fruit">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;hasSeeds"/>

<owl:someValuesFrom rdf:resource="&BotanyExample;Seeds"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

</rdfs:subClassOf>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="&BotanyExample;Color"/>

<owl:someValuesFrom rdf:resource="&xsd;string"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

(b) A contextualized OWL class

Figure 5.4: Example OWL concept contextualization

Page 52: NeCO: Ontology Alignment using Near-miss Clone Detection

5.4. ADDING SOURCE TAGS 42

not refer to any outside elements, and it is fully contextualized. The class “Fruit”

does not contain any <rdfs:subClassOf> lines that refer to other classes, and hence

it is fully contextualized.

5.4 Adding Source Tags

In order to use the output of this section for the clone detector, discussed in Chap-

ter 6, we must prepare the data with source tags. This is done with the TXL rule

addSourceTags shown in Figure 5.5.

The rule addSourceTags is applied once to every extracted class. The rule finds

the lines where a named class appears in the source file, and encloses the fragment of

code within a source tag. The source tag has information about the file contained

within its attributes. The file attribute contains the name of the file containing the

code fragment. The startline and endline attributes contain information about

the line number where the code fragment begins and ends.

An example of the output obtained by rules in Figure 5.5 applied on the code

from Figure 5.2 can be seen in Figure 5.6.

5.5 Summary

In this chapter, we described the process of contextualization. Contextualization

is useful for the task of ontology alignment because it is a way of increasing the

information available for the clone detector. In Chapter 6, we present clone detectors,

and the particular clone detector we use for our experiment.

Page 53: NeCO: Ontology Alignment using Near-miss Clone Detection

5.5. SUMMARY 43

rule addSourceTags

% Find each class, exactly once

skipping [classID]

replace $ [classID]

BeginSrcFilename[srcfilename] BeginSrcLinenumber[srclinenumber]

<owl:Class rdf:ID= IDofClass [stringlit] >

Statements [repeat statement]

EndSrcFilename[srcfilename] EndSrcLinenumber[srclinenumber]

</owl:Class>

% Make sure the class is not empty

deconstruct not BeginSrcFilename

% empty

% Get the class’ original source coordinates in the OWL source file

construct BeginFilenameString [stringlit]

_ [quote BeginSrcFilename]

construct BeginSrcLineString [stringlit]

_ [quote BeginSrcLinenumber]

construct EndSrcLineString [stringlit]

_ [quote EndSrcLinenumber]

% Wrap the class in a <source> tag

by

<source file= BeginFilenameString startline= BeginSrcLineString

endline= EndSrcLineString >

<owl:Class rdf:ID= IDofClass >

Statements

</owl:Class>

</source>

end rule

Figure 5.5: The rule addSourceTags

Page 54: NeCO: Ontology Alignment using Near-miss Clone Detection

5.5. SUMMARY 44

<source file="botanyExample.owl" startline="73" endline="82">

<owl:Class rdf:ID="Fruit">

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="hasSeeds"/>

<owl:someValuesFrom rdf:resource="Seeds"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

</source>

<source file="botanyExample.owl" startline="88" endline="88">

<owl:Class rdf:ID="Seeds">

</owl:Class>

</source>

<source file="botanyExample.owl" startline="94" endline="104">

<owl:Class rdf:ID="Tomato">

<rdfs:subClassOf>

<owl:Class>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="hasSeeds"/>

<owl:someValuesFrom rdf:resource="Seeds"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

</rdfs:subClassOf>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource="Color"/>

<owl:someValuesFrom rdf:resource="&xsd;string"/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

</source>

Figure 5.6: Extracted classes with source tags

Page 55: NeCO: Ontology Alignment using Near-miss Clone Detection

45

Chapter 6

Detecting Concept Clones

In the previous chapter, we outlined the need for contextualization and the way we

contextualize classes. Contextualization alters the extracted fragments of code, by

replacing lines that reference another construct with the actual construct, allowing a

fragment to have all the information available within the ontology of the extracted

classes to be contained within the class. This chapter examines the detection phase

of our method. We use NiCad, a near-miss clone-detection tool, on contextualized

classes to find similar OWL entities across ontologies which we treat as possible

alignments. We explain how NiCad returns a list of potential alignments.

6.1 Clone Detection

Clone detection is the process of finding software clones, a fragment of source code

that has been copied [52]. A clone detector, a tool used for the purpose of clone

detection, uses an algorithm to find software clones. An extraction is performed to

take all fragments of code within a single project, or between projects, which are

called potential clones. When two potential clones are identified by a clone detector

as being possible software clones, the clone detector returns the match as a clone pair.

Page 56: NeCO: Ontology Alignment using Near-miss Clone Detection

6.1. CLONE DETECTION 46

Figure 6.1: Clone-detection step

Near-miss clones is one particular type of code clone [50]. Near-miss clone-

detection has been used on source code projects to find fragments of code that have

been copied from one section of a project to another with modifications [51]. For

this reason, our idea is to approach ontology alignment as a near-miss clone-detection

problem and discover whether treating the document as source-code would be capable

of providing good ontology alignments.

There are several ways to approach the detection of software clones, including

token-based, tree-based, and metric-based techniques [50]. Each method has its own

Page 57: NeCO: Ontology Alignment using Near-miss Clone Detection

6.2. LONGEST COMMON SUBSTRING PROBLEM 47

benefits, however to approach the problem of ontology alignment we chose a text-

based technique because it is light-weight, and text-based techniques perform better

than other techniques if elements of the ontology are removed.

6.2 Longest Common Substring Problem

The Longest Common Substring Problem takes two strings as input, and seeks to

find a third string which is the longest possible substring that can be found in both

strings. An example is illustrated in Figure 6.2. Both strings contain a b c d in that

order, in a non-consecutive way.

A brute-force attempt would enumerate all possible substrings to find the longest

substring that both strings have in common, however this method has a worst-case

complexity of O(2n), where n is the number of characters in the larger string [15].

However, the Longest Common Substring Problem meets the two requirements

for a dynamic programming solution: the problem can be split into sub-problems and

the sub-problems are overlapping [15].

String 1:

a x b c y z d

String 2:

a b w c d

Longest Common Substring:

a b c d

Figure 6.2: Two strings for finding a common substring

Page 58: NeCO: Ontology Alignment using Near-miss Clone Detection

6.3. NICAD 48

6.3 NiCad

NiCad [51] is a near-miss clone detector that uses text-based techniques to identify

clones. The tool is used to find intentional clones, clones that were copied on purpose

from one source to another with changes.

NiCad treats the ontology, a graph, as text. As a result, the alignment problem

becomes a longest common substring problem instead of a graph isomorphism prob-

lem. The advantage of using the longest common substring algorithm is that the

algorithm is fast, whereas it is not known whether there exists an algorithm that can

solve the graph isomorphism problem in polynomial time.

The traditional longest common substring algorithm has been modified to return

the length of the longest common substring rather than the substring. NiCad treats a

line in a file as being the atomic unit of comparison, which are defined by the grammar

file with the conventions of pretty-printing [30, 51]. In our grammar, each XML tag

represents a single line. The NiCad algorithm takes a threshold parameter, which is

the maximum percentage of lines that can be different for two potential clones to be

considered a clone pair. NiCad saves time by rejecting the pairing of two potential

clones if too many of their lines are different.

NiCad uses the extracted elements from the files as potential clones. It calculates

the percentage of lines that are identical, and returns the set of pairs of potential

clones that have a high enough percentage of identical lines of code. NiCad returns

the set as a list of clone pairs with a similarity value for each pair.

We developed scripts to run all the conditions of the dataset so as to perform

many experiments. The results are stored in a file to be accessed later.

Page 59: NeCO: Ontology Alignment using Near-miss Clone Detection

6.4. SUMMARY 49

6.4 Summary

This chapter described clone detection, as well as the algorithms used for NiCad.

NiCad allows us to find elements across ontologies which may correspond to each

other, called a clone pair, and assigns a similarity value for each clone pair. In

Chapter 7, we discuss how we reduce noise by filtering the results so that we only

obtain clone pairs that are of interest to ontology alignment.

Page 60: NeCO: Ontology Alignment using Near-miss Clone Detection

50

Chapter 7

Filtering Process

The previous chapter described the process of clone detection for finding clone pairs,

how NiCad works, and the process of finding code clones. This chapter describes how

we were able to improve our filtering of the results that NiCad returns so that our

process returns fewer clone pairs that are incorrect. The result of filtering the clone

detector’s returned clone pairs to obtain the pair with highest similarity is called a

best-match alignment.

7.1 Filtering

When finding code clones, it may make sense to return many potential clones, as

a developer is not limited to making a single code clone. However, the nature of

software clones is different than that of ontology alignment.

A threshold changes the number of potential clones returned by a clone detector.

A high threshold is required when there are many changes across the clones, and a

low threshold is required if there are few changes. If a threshold is set too high, then

too many potential clones are returned, and if the threshold is set too low, too few

potential clones are returned. For this reason, an optimal threshold is necessary.

Page 61: NeCO: Ontology Alignment using Near-miss Clone Detection

7.1. FILTERING 51

Figure 7.1: Filtering

A single alignment is preferable to returning multiple alignments, which indicates

that the threshold must be low, however the ontologies being compared may have

many changes, which requires a higher threshold. In order to create a general-purpose

alignment tool, we must find a way to combine the advantages of aligning elements

that are different, but return as few of them as possible. To achieve this goal, we

create the filtering process. The filter returns the best clone pair, that is to say the

clone pairs that are most similar to each other. We label the result of the filtering

process the best-match alignment.

Figure 7.1 gives a high-level view of the filtering process. The left side of the figure

has the clone pairs returned from Chapter 6. In this figure, there are three clone pairs

Page 62: NeCO: Ontology Alignment using Near-miss Clone Detection

7.1. FILTERING 52

returned. The topmost clone pairs show the black potential clone is matched with the

green potential clone with a similarity of 85. The middle clone pair shows the black

potential clone is matched with the blue potential clone with a similarity of 77. The

bottom clone pair is the purple potential clone matched with the blue potential clone

with a similarity of 87. Assuming that the black potential clone should be matched

with the green potential clone, and the purple potential clone should be matched with

the blue potential clone, then the goal is to filter out the clone pair that matches the

black potential clone with the blue potential clone.

Our early results gave us a reason for thinking that using clone detection as it is

used on code would not be useful for ontology alignment because using NiCad with-

out any modifications did not yield accurate alignments. Traditional clone-detection

would identify a particular entity in the source file as having multiple entities in

the target file. Because the entities were being contextualized, the entities would

be aligned with the siblings of the gold standard’s proposed alignment. The problem

became more pronounced when a class contextualized many classes. If the class being

contextualized contains another class that is large, then an incorrect alignment would

occur.

In Figure 7.2, both reported potential clones would be identified as being aligned

even though the clone with the higher similarity is more likely to be the correct clone.

Figure 7.2 shows a clone report returned from NiCad, in which there are two clone

pairs, each contained within the clone XML tag. The attributes of the clone tag

include two attributes, most importantly the similarity attribute which indicates

the percentage of lines of the potential clones that are identical. Within the clone

tags, there are two source tags. Each source tag contains information about the

Page 63: NeCO: Ontology Alignment using Near-miss Clone Detection

7.1. FILTERING 53

<clone nlines="132" similarity="100">

<source file="Onto1.owl" startline="424" endline="459" pcid="19"></source>

<source file="Onto2.owl" startline="424" endline="459" pcid="52"></source>

</clone>

<clone nlines="132" similarity="78">

<source file="Onto1.owl" startline="424" endline="459" pcid="19"></source>

<source file="Onto2.owl" startline="206" endline="222" pcid="38"></source>

</clone>

Figure 7.2: Clone report returned by NiCad

ontologies being aligned, with additional information about the element being aligned.

The pcid attribute gives a unique identifier to each potential clone. Lowering the

threshold could remove alignments that are correct, and raising the threshold NiCad

to return too many clone pairs. Our goal was to remove clone pairs that were not

of interest from NiCad’s results. The experimental conditions for evaluating our tool

are described in Chapter 8 and the results are further detailed in Chapter 9.

The filtering of NiCad’s results returns the best-match clone pairs. For each

concept in the source ontology, we return the set of the concepts in the target ontology

which has the highest similarity value in the target ontology. If there is more than one

possible answer with the highest similarity value, then all the similarity values that

match the highest value are returned. Returning multiple alignments is done because

there is no way to determine which of the two clone pairs with the highest similarity

is the better answer. In adding this extra step, our alignment method removes noise

and improves accuracy.

Alignments specify whether an element from the source ontology is is equal, more

general, or less general than the element from the target ontology. For the sake of

simplicity, each alignment is assumed to be an equal alignment, that is to say the

entity from the source ontology is the same as the entity in the target ontology. There

is also a confidence level that is between 0 and 1, the confidence level is set to 1 for

Page 64: NeCO: Ontology Alignment using Near-miss Clone Detection

7.2. SUMMARY 54

all alignments [21]. The use of other types of alignments and confidence levels are

something to be addressed in future work.

For example, Figure 7.2 shows two clone pairs that refer to the same potential

clone, as is evident by having the same pcid attribute. In each clone pair, the target

ontology’s potential clone is different, as each target ontology’s pcid is different.

The similarities of the first and second clone pairs are different; the first clone pair’s

similarity is higher than the second clone pair’s similarity. The filtering process would

find that the highest similarity for pcid="19" is 100. The first clone pair matches

the highest similarity value for pcid="19", but the second clone pair has a lower

similarity, so it not returned as an alignment.

7.2 Summary

This chapter presented how we filter NiCad’s results. Each clone pair after filtering

returns a best-match clone, which is treated as an ontology alignment. The next

chapter outlines the details of our experiment. We describe the dataset we used

for experimentation, as well as our results for differing thresholds, traditional clone-

detection, and best-match clone-detection.

Page 65: NeCO: Ontology Alignment using Near-miss Clone Detection

55

Chapter 8

An Experiment

In the previous chapter, we described the filtering process, the final step of our align-

ment method. This chapter describes the design of our experiment which consists of

a description of the dataset, the measurements we use, the gold standard, and how

we evaluate NeCO.

8.1 The Datasets

Datasets for the evaluation of ontology alignment tools are provided by the Ontology

Alignment Evaluation Initiative (OAEI)1. The datasets allow ontology alignment re-

searchers to evaluate the effectiveness, as well as the weaknesses, of their tools. These

weaknesses may consist of difficulty in aligning ontologies from a particular domains

or with a type of alteration to the dataset. There are websites for each year’s evalu-

ation2,3 where the dataset is made available, and so that previous competitions may

be attempted. The OAEI seeks to make a standard evaluation dataset to improve

1http://oaei.ontologymatching.org/2http://oaei.ontologymatching.org/2011/3http://oaei.ontologymatching.org/2012/

Page 66: NeCO: Ontology Alignment using Near-miss Clone Detection

8.2. MEASUREMENTS 56

ontology alignment techniques and present findings at conferences4. The existence of

a standard dataset allows us to evaluate our tool, and gives us a benchmark through

which we can compare our alignment method with other tools.

We took particular interest in the Biblio dataset, one of many datasets through

which we can evaluate our alignment tool. The Biblio dataset is generated by an

automatic test data generation tool [48]. Each test transforms a source ontology

systematically, such as changing all labels to ontology elements to a random string.

Using this dataset allows us to evaluate our alignment method’s strengths and weak-

nesses. Each test is an ontology with one or many transformations performed on the

source ontology, and it is assigned a number so that someone testing a tool may use

the corresponding reference alignment to verify on which tests the task performs well,

and which transformations on the seed ontology make the tool perform poorly. We

began by using the 2011 version of the dataset which provides tests for aligning the

domain of ‘bibliography’.

8.2 Measurements

There are three standard statistics used to evaluate the effectiveness of an ontology

alignment method: precision, recall, and F-measure. These metrics are used for the

task of ontology alignment [20].

Precision is a percentage of how many correct alignments are returned by the

alignment algorithm. As shown in the equation, a low precision means a tool pro-

viding an alignment has many returned alignments, but only a low percentage of the

alignments are actual alignments. A high precision means that the tool has provided

4http://oaei.ontologymatching.org/

Page 67: NeCO: Ontology Alignment using Near-miss Clone Detection

8.2. MEASUREMENTS 57

a high number of correct alignments with a low number of incorrect alignments.

precision =|{Filtered Elements} ∩ {Gold Standard′s Pairs}|

|{Filtered Elements}|

Recall is a percentage of how many correct alignments the tool discovered from

the gold standard’s alignments. A low recall occurs if a tool discovers few of the

alignments provided by the gold standard, whereas a high recall is obtained if a tool

discovers many of the alignments.

recall =|{Filtered Elements} ∩ {Gold Standard′s Pairs}|

|{Gold Standard Pairs}|

A tool may have high precision and low recall, in which case there are few returned

alignments, but a high percentage of the alignments are correct. A low precision and

high recall method returns many alignments, but few of the alignments are of interest.

The special case in calculating precision and recall occurs when NeCO does not

return any alignments for a particular test. When this happens, the precision for that

particular test is considered to be undefined, therefore we do not apply this precision

value to the average precision. However, the recall is 0, and these tests are included

for the average recall.

F-measure is the harmonic mean of precision and recall. If precision or recall are

low, then F-measure will be low as well. However, if precision and recall are equal,

then F-measure will be equal to both these measures. For this reason, we decided

that F-measure is the most important measure of how successful an alignment had

been.

Page 68: NeCO: Ontology Alignment using Near-miss Clone Detection

8.3. GOLD STANDARD 58

<map>

<Cell>

<entity1 rdf:resource="onto1;Chapter"/>

<entity2 rdf:resource="onto2;Chapitre"/>

<measure rdf:datatype="http://www.w3.org/2001/XMLSchema#float">

1.0

</measure>

<relation>=</relation>

</Cell>

</map>

Figure 8.1: An RDF alignment for two ontologies

<dt>Chapter = Chapitre</dt>

<dd>1.0</dd>

Figure 8.2: An html alignment for two ontologies

F −measure =2 ∗ precision ∗ recallprecision + recall

The average F-measure for a series of tests with a certain parameter takes the

average precision and average recall as described above.

8.3 Gold Standard

A gold standard is provided in both html, displayed in Figure 8.2, and RDF formats,

shown in Figure 8.1, for the 2011 dataset.

We used the html file for the evaluation of our alignment. The RDF file, shown in

Figure 8.1, could also be used for evaluation, but for our script which computed the

evaluation, we chose to use the html file. The script opens the html alignment, and

creates a hash table where the left-hand side of the equals sign is the key, and the

right-hand side of the equals sign is the value. For each alignment, the script finds

Page 69: NeCO: Ontology Alignment using Near-miss Clone Detection

8.4. EXPERIMENT CONDITIONS 59

the source ontology’s element and the target ontology’s element in the clone pair.

The script makes an array where each array element is a tuple which includes the

ontology elements from the source ontology and the target ontology. We calculate

the size of the intersection of the lists with the dictionary for the numerator of the

precision and recall equations. We then used the size of the array containing the

proposed alignments and the number of entries in the dictionary as the denominator

for precision and recall respectively.

8.4 Experiment Conditions

The goal of this thesis is to assess the feasibility of using clone detection techniques for

ontology alignment. With this in mind, we seek to measure how well NeCO can align

classes in ontologies, and to compare how well the alignment of classes compares to

other tools which are trying to align all elements of OWL ontologies. The alignment

of object properties, datatype properties, and individuals are left for future work.

NeCO uses NiCad, which has parameters.The most important parameter is the

ability to select a threshold for evaluation. In order to determine the most effective

threshold for evaluation, we tried many thresholds.

A parameter we kept consistent among all trials was the minimum number of lines

that a potential clone had to have in order to be evaluated by NiCad. We set the

minimum number of lines to 3 as there were some cases where the class would be 3

lines long before contextualization, and would be more lines after contextualization.

For a baseline, we ran the dataset with and without the filter. We compare

the results from NeCO and the results from contextualized and unfiltered NiCad

alignments to compare the two conditions. We also ran the best parameters for the

Page 70: NeCO: Ontology Alignment using Near-miss Clone Detection

8.5. SUMMARY 60

2011 dataset on the 2012 dataset to verify whether the results are comparable for the

two datasets.

8.5 Summary

This chapter discusses the experimental conditions we set for the ontology alignment.

We outline how we measure the tool’s results, as well as describe the dataset. The

next chapter describes the results of the experiment using the experimental conditions

explained in this chapter.

Page 71: NeCO: Ontology Alignment using Near-miss Clone Detection

61

Chapter 9

Results

In the previous chapter, we described the details of our experiment. This chapter

presents the results of our experiments on the 2011 Biblio dataset, along with a

comparison of NeCO to other algorithms. We then ran NeCO on the 2012 Biblio

dataset and report those results.

9.1 2011 Ontology Alignment Results

The Ontology Alignment Evaluation Initiative (OAEI) publishes results for each year

they present data. The results of the tools that entered the competition are available

for us to use in comparison with our tool [20]. The dataset used for a baseline is the

Biblio dataset. This dataset has a base ontology which describes bibliographies. The

base ontology has systematic alterations performed on it to create a new, modified

ontology describing the same subject matter. The base ontology and its modified

version are aligned.

Table 9.1 shows the results for the tools that entered the 2011 competition, and

provided us with a benchmark with which to evaluate our tool.

We performed two tests to compare with the tools shown in Table 9.1. We ran

Page 72: NeCO: Ontology Alignment using Near-miss Clone Detection

9.1. 2011 ONTOLOGY ALIGNMENT RESULTS 62

edna AgreementMaker

tests Precision F-measure Recall Precision F-measure Recall

100-level 1.00 1.00 1.00 1.00 1.00 1.00200-level 0.50 0.51 0.51 0.98 0.71 0.56

H-mean 0.50 0.51 0.52 0.98 0.71 0.56

Aroma CSA

tests Precision F-measure Recall Precision F-measure Recall

100-level 1.00 1.00 1.00 1.00 1.00 1.00200-level 0.93 0.68 0.53 0.82 0.72 0.64

H-mean 0.93 0.68 0.53 0.82 0.73 0.65

CIDER CODI

tests Precision F-measure Recall Precision F-measure Recall

100-level 0.96 1.00 0.96 0.25 0.19 0.15200-level 0.89 0.70 0.58 0.94 0.74 0.61

H-mean 0.89 0.70 0.58 0.93 0.73 0.60

LDOA Lily

tests Precision F-measure Recall Precision F-measure Recall

100-level 0.77 0.86 0.97 1.00 1.00 1.00200-level 0.51 0.51 0.51 0.93 0.70 0.57

H-mean 0.51 0.51 0.51 0.93 0.70 0.57

LogMap MassMatch

tests Precision F-measure Recall Precision F-measure Recall

100-level 0.99 1.00 1.00 0.99 0.96 0.92200-level 0.54 0.31 0.21 0.63 0.63 0.62

H-mean 0.55 0.32 0.22 0.64 0.63 0.62

MapEvo MapPSO

tests Precision F-measure Recall Precision F-measure Recall

100-level 1.00 1.00 1.00 1.00 0.99 0.98

200-level 0.99 0.66 0.49 0.99 0.60 0.43

H-means 0.99 0.67 0.50 0.99 0.61 0.44

MapSSS YAM++

tests Precision F-measure Recall Precision F-measure Recall

100-level 1.00 1.00 1.00 1.00 1.00 1.00200-level 0.96 0.77 0.64 0.97 0.74 0.60

H-mean 0.96 0.77 0.64 0.97 0.74 0.60

Optima

100-level 1.00 1.00 1.00200-level 0.59 0.55 0.52

H-mean 0.60 0.56 0.53

Table 9.1: Alignment tool results on the Biblio dataset

Page 73: NeCO: Ontology Alignment using Near-miss Clone Detection

9.1. 2011 ONTOLOGY ALIGNMENT RESULTS 63

Threshold Precision F-measure Recall0.00 1.00 0.06 0.030.10 0.88 0.09 0.050.20 0.74 0.35 0.230.30 0.40 0.43 0.460.40 0.15 0.23 0.510.50 0.07 0.12 0.530.60 0.05 0.09 0.540.70 0.03 0.06 0.550.80 0.05 0.09 0.610.90 0.03 0.06 0.72

Table 9.2: Traditional clone-detection results

Threshold Precision F-measure Recall0.00 1.00 0.06 0.030.10 0.95 0.10 0.050.20 0.87 0.36 0.230.30 0.78 0.56 0.440.40 0.69 0.56 0.470.50 0.65 0.55 0.480.60 0.65 0.55 0.480.70 0.65 0.55 0.480.80 0.60 0.56 0.530.90 0.49 0.54 0.61

Table 9.3: NeCO’s results

traditional clone-detection without contextualization and NeCO with varying thresh-

olds to find the best threshold. We then compare the two methods to verify whether

filtering is a practical step, or whether near-miss clone detection without filtering

works for ontology alignment.

Traditional Clone detection’s results without filtering are shown in Table 9.2 and

Figure 9.1. NeCO’s results for all the tests are shown in Table 9.3 and Figure 9.2.

Page 74: NeCO: Ontology Alignment using Near-miss Clone Detection

9.1. 2011 ONTOLOGY ALIGNMENT RESULTS 64

Figure 9.1: Traditional clone detection’s statistics for varying thresholds

Figure 9.2: NeCO’s statistics for varying thresholds

Page 75: NeCO: Ontology Alignment using Near-miss Clone Detection

9.1. 2011 ONTOLOGY ALIGNMENT RESULTS 65

Clone Detection NeCOTests Precision F-measure Recall Precision F-measure Recall

100-level 0.19 0.32 0.98 1.00 0.99 0.98200-level 0.41 0.42 0.44 0.77 0.55 0.43H-mean 0.40 0.43 0.46 0.79 0.56 0.44

Table 9.4: Average for test levels

We consider F-measure to be the best sign of success of a particular threshold. F-

measure is a harmonic mean between precision and recall. The best threshold for both

traditional clone-detection and NeCO is 0.30 according to our experiments shown in

Table 9.2 and Table 9.3. By comparing the results for Table 9.2 and Table 9.3, we find

that the filtering process greatly increases precision with a small cost to recall when

the threshold is 0.30. Because of the filtering step, NeCO has a higher F-measure,

and can therefore be considered the better method of finding clones.

The 100-level and 200-level tests results are shown in Table 9.4 for comparison

with other tools. The threshold traditional clone detection and NeCO are set to 0.30

as they provided the highest F-measure for all the tests.

We compare clone-detection and NeCO to the other alignment tools shown in

Table 9.1 in a chart shown in Figure 9.3. Each entry on the chart is one tool’s

precision and recall for the 2011 dataset. NeCO’s results are similar to many tools,

yielding a better precision than some; however NeCO has worse recall than many of

the other tools.

Table 9.5 shows the precision of the tools sorted in descending order. Clone

detection and NeCO are shown in bold with the threshold value set to 0.30. The

best-match alignment tool has a precision that is comparable to other alignment

tools. However, the traditional clone detection technique has the worst precision of

Page 76: NeCO: Ontology Alignment using Near-miss Clone Detection

9.1. 2011 ONTOLOGY ALIGNMENT RESULTS 66

Figure 9.3: Chart comparing alignment tools with the 2011 dataset

all the alignment tools. Table 9.5 shows that the filtering process improves precision.

Table 9.6 shows the recall of all the tools entered in the 2011 competition. The

recall of our method is quite low, and it shows the greatest weakness of our method.

Finally, Table 9.7 shows the F-measure for our tool and the other tools that entered

the 2011 OAEI competition. NeCO performs better than some tools, and a lot better

than traditional clone detection.

Page 77: NeCO: Ontology Alignment using Near-miss Clone Detection

9.1. 2011 ONTOLOGY ALIGNMENT RESULTS 67

Tools sorted by PrecisionLogMap 0.99

MassMatch 0.99AgreementMaker 0.98

YAM++ 0.97Aroma 0.93CODI 0.93Lily 0.93

CIDER 0.89CSA 0.89

MapSSS 0.80NeCO 0.79Optima 0.60MapEvo 0.54LDOA 0.51edna 0.50

Traditional Clone Detection 0.30

Table 9.5: Tools’ precision on the 2011 Biblio dataset

Tools sorted by RecallCSA 0.65

MapSSS 0.62CODI 0.60

YAM++ 0.60CIDER 0.58

Lily 0.57AgreementMaker 0.56

Aroma 0.53Optima 0.53

edna 0.52LDOA 0.51

LogMap 0.50Traditional Clone Detection 0.45

NeCO 0.44MassMatch 0.44

MapEvo 0.22

Table 9.6: Tools’ recall for the 2011 dataset

Page 78: NeCO: Ontology Alignment using Near-miss Clone Detection

9.2. ANALYSIS 68

Tools sorted by F-measureMapSSS 0.77CODI 0.74

YAM++ 0.74CSA 0.73

AgreementMaker 0.71CIDER 0.70

Lily 0.70Aroma 0.68

LogMap 0.67MassMatch 0.61

NeCO 0.56Optima 0.56

edna 0.51LDOA 0.51

Traditional Clone Detection 0.43MapEvo 0.32

Table 9.7: Tools’ F-measure results for the 2011 dataset

9.2 Analysis

The changes made to the source ontology were changes to the labels attached to

classes, either by changing the labels to a random string or translating the ontology’s

labels into French. Likewise, the removal of comments, any references of properties,

and the removal of individuals are possible alterations made to the seed ontology

for the generation of tests. Another systematic change done to the ontology was

changing the hierarchy of the ontology, either by flattening, expanding, or suppressing

the hierarchy. Finally, some tests included the removal of restrictions within classes,

which left no lines of code within the extracted classes.

Page 79: NeCO: Ontology Alignment using Near-miss Clone Detection

9.2. ANALYSIS 69

Modification Precision F-measure RecallLabels Changed 0.78 0.56 0.44

Comments Removed 0.77 0.52 0.39Instances Removed 0.75 0.52 0.40Flattened Hierarchy 0.76 0.66 0.58Expanded Hierarchy 0.52 0.32 0.23Suppressed Hierarchy 0.81 0.16 0.09

Property Restrictions Removed Undefined Undefined 0

Table 9.8: NeCO’s statistics for different alterations to the ontology

There are 102 tests in the 2011 dataset made up of the changes previously men-

tioned, and different combinations of these changes1. An overview of the transforma-

tions performed on an ontology and the resulting changes in NeCO’s precision and

recall using NeCO are shown in Table 9.8.

The removal of comments did not pose any problems for NeCO because it re-

moves any comments automatically. Changing labels of classes and properties led

to a decrease in precision, however no decrease in recall was recorded due to that

change. NeCO is able to find alignments due to the structural similarities between

the original class, and its altered version. The removal of instances also did not cause

any problems, because we do not work on aligning instances in this version of the

tool.

Changes to the Biblio hierarchy changes the structure of the graph representing the

ontology. The ontology is represented as a graph without cycles, and changes to the

structure of the structure representing the ontology had varying effects. Flattening

the hierarchy is reducing the number of vertices in the graph; expanding the hierarchy

is introducing intermediate classes to increase the number of vertices; suppressing the

hierarchy is to change the structure of the graph so that every element is the child of

1http://oaei.ontologymatching.org/2011/

Page 80: NeCO: Ontology Alignment using Near-miss Clone Detection

9.2. ANALYSIS 70

the OWL construct called “Thing”. All changes in precision for these conditions can

be explained by the changing of the structure of extracted elements.

Flattening in the ontology led to an increase in recall. Expanding the hierarchy,

introducing intermediary classes, caused a reduction in precision and recall. We

believe the precision and recall fall because the increase in the number of intermediary

classes lowers the similarity value of the correct alignment, and increases the likelihood

that another element in the target ontology may be incorrectly aligned with the

element. The addition of lines of OWL code may also decrease the similarity of correct

alignments below the threshold, which leads to a decrease in recall. Suppressing the

hierarchy, removing all hierarchical information, caused great difficulty to NeCO with

a slightly higher precision, and a very low recall. We believe the similarity value of

elements falls substantially, which leads to many potential alignments falling below

the threshold, which then leads to a fall in recall. Much of NeCO’s information comes

from contextualization, so suppressing the hierarchy of the ontology means that NeCO

cannot properly contextualize elements.

The suppressing of restrictions on classes was a test that caused a lot of difficulty

for our tool. Classes are defined as a set of individuals, and when defining classes,

one must include the restrictions of that class. Because the restrictions were removed,

there was no information for the clone detector to align ontologies. NeCO did not

return any alignments for tests that include the suppressing of restrictions.

Page 81: NeCO: Ontology Alignment using Near-miss Clone Detection

9.3. 2012 ONTOLOGY ALIGNMENT RESULTS 71

Threshold Precision F-measure Recall0.30 0.52 0.32 0.230.40 0.43 0.35 0.29

Table 9.9: NeCO results for the 2012 Biblio dataset

Test Level Precision F-measure Recall100-level 1.00 1.00 1.00200-level 0.41 0.33 0.27

Table 9.10: Statistics for NeCO’s results for the 2012 Biblio Dataset by test level

9.3 2012 Ontology Alignment Results

The 2012 Biblio dataset introduced some changes, and is available from the OAEI

2012 website2. More tools participated in the 2012 dataset [1]. We concentrate on

comparing the 2012 results of our tool with the 2011 results.

We ran our tool with the thresholds 0.30, and 0.40 because 0.30 provided our

best results, but 0.40 was not much worse. Our goal was to examine whether the

results from the 2011 dataset would be comparable to the results in the 2012 dataset.

The 2012 dataset uses the same source ontology with some syntactical changes. The

meaning is unchanged.

The 2012 dataset contained a lot more of the tests on which we did poorly, and

removed the easier tests from the 2011 dataset. The changes to the ontology are the

same we describe in Section 9.2. Our results are shown in Table 9.9 for the thresholds

0.30 and 0.40.

Table 9.10 is a list of how well the tool performs for the 100 and 200 level tests.

This is included to compare the tool to other tools that we showed in Table 9.1.

Our results fell considerably for the 2012 dataset. We found that a threshold of

2http://oaei.ontologymatching.org/2012/

Page 82: NeCO: Ontology Alignment using Near-miss Clone Detection

9.4. RUN-TIME 72

0.40 was best, but we could not replicate the success we had with the 2011 dataset.

9.4 Run-time

We cannot directly compare the run-times of the OAEI competitions with the run-

time of our method as we do not have access to a computer that is of the same

specifications, we run different operating systems, and the scope of our alignments

is different. However, it makes sense to indicate our execution time given that our

initial requirement was to achieve a quick alignment.

The run-time of our method on a machine with a 2.2 GHz Intel Core i7 processor

with 4 GB of RAM running MacOS X version 10.7.5 is approximately 50 seconds. The

tests done for the 2011 alignment ran on a machine with a 3GHz Xeon 5472 processor

with 8 GB of RAM running Linux Fedora 8. Our machine is not as powerful as the

machine that ran the evaluation, however we only perform alignments on a small part

of the ontology.

Table 9.11 shows the run-time for the Biblio dataset in minutes. Some tools did

not parse all the ontologies or failed to complete, and for these reasons, the run-time is

not included [20]. Because our tool took less than one minute on the weaker machine,

we are confident that the tool would run in a short amount of time, perhaps even less

than the fastest tool in the table.

9.5 Blind Renaming

Blind renaming refers to the process of ignoring differences in identifiers when compar-

ing potential clones in clone detection. Typically, blind renaming leads to a significant

increase in recall at the cost of precision. In some applications, such as software model

Page 83: NeCO: Ontology Alignment using Near-miss Clone Detection

9.6. SIMILARITY RANGE 73

System RuntimeNeco 0.83edna 1.06

Aroma 1.10LogMap 2.16

CSA 2.61YAM++ 6.68MapEvo 7.44

Lily 8.76CIDER 30.30

MassMatch 36.06Optima 149MapPso 185LDOA 1020

Table 9.11: Tools’ execution time for the 2011 Biblio dataset in minutes

clones, blind renaming yields much higher overall accuracy.

As an additional experiment, we attempted to perform an alignment with blind

renaming. However, in our case this technique leads to a decrease in both precision

and recall. Recall was decreased due to the best-match process, where the correct

answer was rejected because it did not have the highest similarity value.

9.6 Similarity Range

We also attempted to make ranges for matches. Rather than picking a best-match,

we made a method of finding a range which we called alpha. At first, we tried an

absolute range, so that we would return all answers that have a similarity higher

or equal to the difference between the highest similarity and alpha. For example, if

alpha was set at 5%, and the highest similarity for a particular element was 90%,

then every alignment above 85% was returned. We found that precision decreased,

with an increase in recall, however f-measure decreased.

Page 84: NeCO: Ontology Alignment using Near-miss Clone Detection

9.7. SUMMARY 74

Next, we performed a relative range. Alpha was set to be a percentage of the

difference between 100 and the highest similarity. This was called beta. For example,

if beta was 10%, and the highest similarity was 80% for a particular element, then

alpha would be set to 2%, and an alignment would return all alignments that had

a similarity above 78%. Precision decreased to a smaller degree than our previous

experiment, and recall increased. F-measure fluctuated with different values for alpha

and beta, but remained about the same.

Finally, we created an offset variable. The idea was that if two elements have high

similarity, then a specific alignment is more certain. We used the alpha and beta

variables from the previous paragraph, but the offset was to say that if two elements

had a similarity above the offset, then one shouldn’t use a range. For example, with

an offset of 10%, alpha at 10%, and beta at 10%, if the highest alignment is 90%,

then the values for alpha and beta were not used. However, for the a particular

element from the source ontology had a highest similarity value of 80% with the

same parameters as the situation above, then the range would accept all alignments

with similarity 79% and above. The results of this experiment were similar to the

ones above, where F-measure fluctuated with different values, but did not show a

noticeable improvement.

9.7 Summary

This chapter provided the results of our experiment. We compared the results of

using clone detection for alignment with and without the filtering process. We found

that the filtering process increases F-measure by increasing precision with a small

cost to recall. We conclude that filtering results provides for better alignments.

Page 85: NeCO: Ontology Alignment using Near-miss Clone Detection

75

Chapter 10

Summary and Conclusions

The previous chapter presented our method of evaluating our alignment tool, and

the settings we used for evaluation. This chapter summarizes the work, recapitulates

the results obtained by NeCO, discusses the limitations of our method, and indicates

directions for future work.

10.1 Summary

In this thesis, we presented an ontology alignment technique with the use of clone

detection techniques. We postulated that clone detection techniques can be used as a

general-purpose ontology alignment technique that is fast to use and does not require

any prior training.

Our technique consists of 4 phases: extraction, contextualization, clone detection,

and filtering. We implement the method, and call it NeCO, the Near-miss Clone

Ontology Alignment tool. We measure NeCO’s precision, recall, and F-measure of

the results of the 2011 dataset, then run the 2012 dataset with the same parameters

for comparison.

During the extraction phase, we prepare the data for evaluation, so that it can be

Page 86: NeCO: Ontology Alignment using Near-miss Clone Detection

10.1. SUMMARY 76

used with NiCad. The extraction is done with rules written in TXL which finds all

the elements of interest within an ontology, and stores the elements for analysis by

NiCad.

The next phase is the contextualization phase. The extracted elements are con-

textualized, meaning the references to other elements are replaced with definitions of

the elements being referenced. NiCad is good at making matches with source code

projects, and the attempt to find WSDL clones with contextualization had been suc-

cessful [38], another language which contains references to outside constructs. These

contextualized elements are fed into NiCad.

Next, both ontologies’ contextualized classes are fed into a clone-detector. NiCad

runs with a threshold and finds all the elements from each ontology with a percentage

of different lines below the threshold. This step returns clone pairs, however these

results return too many possible alignments.

To improve the accuracy of the alignment tool, results are filtered so that only the

best alignments are returned. The best alignments are taken from NiCad’s output,

and by finding the highest similarity value for a particular element in the source

ontology.

We compared our results of aligning classes to the results of other tools’ alignments

of classes, properties, and individuals; we found that our tool was able to perform

alignments reasonably well. We found that NeCO has difficulty aligning elements

if the hierarchy of an ontology is suppressed or expanded. These changes to the

structure removed much of the information that NeCO requires to work properly.

NeCO does well when aligning ontologies that are actually near-miss clones of

each other. For example, if there is an original ontology, and two different authors

Page 87: NeCO: Ontology Alignment using Near-miss Clone Detection

10.2. CONTRIBUTIONS 77

have supplemented the original ontology, then both modified versions can be aligned

with each other using this method. More work has to be done to determine whether

it can become a general purpose ontology alignment tool for ontologies that are not

near-miss clones of each other.

10.2 Contributions

This thesis provides two contributions. This thesis proposes that the research done

in the field of clone-detection may be beneficial to the ontology alignment problem.

We use contextualization, a technique from the clone detection community to a new

problem.

Our second contribution is the notion of finding a single best answer from a number

of clone pairs by comparing similarity values. This is important for ontology align-

ment, as the traditional clone-detection approach of returning all pairs over a given

threshold is not desirable; rather our task requires us to return the fewest number of

possible alignments in order to improve the overall precision.

10.3 Limitations

The efficiency of our method is limited if a source ontology’s hierarchy is altered.

This is because NeCO uses the structure of extracted elements for alignment. NeCO

uses the information contained within the contextualized elements, however if these

elements are heavily altered, then NeCO does not have a way to flag two elements as

being aligned.

Page 88: NeCO: Ontology Alignment using Near-miss Clone Detection

10.4. FUTURE WORK 78

10.4 Future Work

Near-miss clone detection was developed for use of intentional software clones [51].

As shown in Chapter 9, our method works quite well to align ontologies that are

modified versions of each other without altering the structure of the ontology. Based

on those results we see the following directions for future work:

• Extending the method to align other elements of interest, such as datatype

properties, object properties, and individuals.

• Improving the contextualization process by contextualizing every reference to a

class, rather than finding the subClassOf statements for contextualization.

• Contextualizing datatype and object properties references with the definition

of the referenced properties.

• Ordering the XML tags so that tag order does not matter [2].

• Examining the use of consistent renaming.

• Testing ontologies that are modified versions of each other.

Page 89: NeCO: Ontology Alignment using Near-miss Clone Detection

BIBLIOGRAPHY 79

Bibliography

[1] Jose Aguirre et al. Results of the ontology alignment evaluation initiative 2012.

In Proceedings of the 7th ISWC Workshop on Ontology Matching, pages 73–115,

2012.

[2] Manar H Alalfi, James R Cordy, Thomas R Dean, Matthew Stephan, and An-

drew Stevenson. Models are code too: Near-miss clone detection for simulink

models. In Software Maintenance (ICSM), 2012 28th IEEE International Con-

ference on, pages 295–304. IEEE, 2012.

[3] Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyga-

niak, and Zachary Ives. DBpedia: A nucleus for a web of open data. The

Semantic Web, pages 722–735, 2007.

[4] Sean Bechhofer et al. OWL web ontology language reference. http://www.w3.

org/TR/owl-ref/, 2004. Last accessed 01-15-2014.

[5] Tim Berners-Lee. Design issues - linked data. http://www.w3.org/

DesignIssues/LinkedData.html. Last accessed 01-15-2014.

[6] Tim Berners-Lee et al. The semantic web. Scientific American, 284(5):28–37,

2001.

Page 90: NeCO: Ontology Alignment using Near-miss Clone Detection

BIBLIOGRAPHY 80

[7] Tim Berners-Lee et al. Linked data - the story so far. International Journal on

Semantic Web and Information Systems, 5(3):1–22, 2009.

[8] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.

the Journal of Machine Learning Research, 3:993–1022, 2003.

[9] Jurgen Bock, Carsten Danschel, and Matthias Stumpp. MapPSO and MapEVO

results for OAEI 2011. Ontology Matching, pages 179–183, 2011.

[10] Tim Bray, Jean Paoli, C Michael Sperberg-McQueen, Eve Maler, and Francois

Yergeau. Extensible markup language (XML). World Wide Web Journal,

2(4):27–66, 1997.

[11] Dan Brickley and Ramanathan V Guha. RDF vocabulary description language

1.0: RDF schema. http://www.w3.org/TR/rdf-schema/, 2004. Last accessed

01-15-2014.

[12] Dan Brickley and Libby Miller. FOAF vocabulary specification 0.98. http:

//xmlns.com/foaf/spec/, 2010. Last accessed 01-15-2014.

[13] Michelle Cheatham. MapSSS results for OAEI 2011. In Proceedings of the ISWC

2011 Workshop on Ontology Matching, pages 184–190, 2011.

[14] James R. Cordy. The TXL source transformation language. Science of Computer

Programming, 61(3):190–210, 2006.

[15] T.H. Cormen. Introduction to Algorithms. The MIT Press, 2001.

Page 91: NeCO: Ontology Alignment using Near-miss Clone Detection

BIBLIOGRAPHY 81

[16] Isabel F Cruz, Flavio Palandri Antonelli, and Cosmin Stroe. AgreementMaker:

Efficient matching for large real-world schemas and ontologies. Proceedings of

the VLDB Endowment, 2(2):1586–1589, 2009.

[17] Jerome David, Fabrice Guillet, and Henri Briand. Matching directories and

OWL ontologies with AROMA. In Proceedings of the 15 th ACM International

Conference on Information and Knowledge Management, volume 6, pages 830–

831, 2006.

[18] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood

from incomplete data via the em algorithm. Journal of the Royal Statistical

Society. Series B (Methodological), pages 1–38, 1977.

[19] Pedro Domingos et al. Just add weights: Markov logic for the semantic web.

Uncertainty Reasoning for the Semantic Web I, pages 1–25, 2008.

[20] Jerome Euzenat et al. Results of the ontology alignment evaluation initiative

2011. In Proceedings of the 6th ISWC Workshop on Ontology Matching, pages

85–110, 2011.

[21] Jerome Euzenat and Pavel Shvaiko. Ontology Matching, volume 18. Springer

Berlin, 2007.

[22] Johannes Gehrke and Raghu Ramakrishnan. Database Management Systems.

USA: McGraw-Hill Companies, Inc, 2003.

[23] Christine Golbreich, Evan K Wallace, and Peter F Patel-Schneider. OWL 2 web

ontology language: New features and rationale. http://www.w3.org/TR/2009/

WD-owl2-new-features-20090611, June 2009. Last accessed 01-15-2014.

Page 92: NeCO: Ontology Alignment using Near-miss Clone Detection

BIBLIOGRAPHY 82

[24] Jorge Gracia, Jordi Bernad, and Eduardo Mena. Ontology matching with

CIDER: Evaluation report for OAEI 2011. Ontology Matching, pages 126–133,

2011.

[25] W3C OWL Working Group. OWL 2 web ontology language doc-

ument overview (second edition). http://www.w3.org/TR/2012/

REC-owl2-overview-20121211/, 2012. Last accessed 01-15-2014.

[26] Sven Hertling. Hertuda results for OAEI 2012. Ontology Matching, pages 141–

144, 2012.

[27] Sven Hertling and Heiko Paulheim. WikiMatch results for OAEI 2012. Ontology

Matching, pages 220–225, 2012.

[28] Wei Hu, Yuzhong Qu, and Gong Cheng. Matching large ontologies: A divide-

and-conquer approach. Data & Knowledge Engineering, 67(1):140–160, 2008.

[29] Jakob Huber, Timo Sztyler, Jan Noessner, and Christian Meilicke. CODI: Com-

binatorial optimization for data integration–results for OAEI 2011. Ontology

Matching, pages 134–141, 2011.

[30] James W Hunt and Thomas G Szymanski. A fast algorithm for computing

longest common subsequences. Communications of the ACM, 20(5):350–353,

1977.

[31] James Wayne Hunt and M Douglas McIlroy. An Algorithm for Differential File

Comparison. Bell Laboratories, 1976.

Page 93: NeCO: Ontology Alignment using Near-miss Clone Detection

BIBLIOGRAPHY 83

[32] Yves R Jean-Mary, E Patrick Shironoshita, and Mansur R Kabuka. Ontology

matching with semantic verification. Web Semantics: Science, Services and

Agents on the World Wide Web, 7(3):235–251, 2009.

[33] Ernesto Jimenez-Ruiz and Bernardo Cuenca Grau. Logmap: Logic-based and

scalable ontology matching. The Semantic Web – ISWC 2011, pages 273–288,

2011.

[34] Marouen Kachroudi, Essia Ben Moussa, Sami Zghal, and Sadok Ben. LDOA

results for OAEI 2011. Ontology Matching, pages 148–155, 2011.

[35] P. Kremen, M. Smid, and Z. Kouba. OWLDiff: A practical tool for comparison

and merge of OWL ontologies. In Database and Expert Systems Applications

(DEXA), 2011 22nd International Workshop on, pages 229–233. IEEE, 2011.

[36] Patrick Lambrix and He Tan. Sambo a system for aligning and merging biomed-

ical ontologies. Web Semantics: Science, Services and Agents on the World Wide

Web, 4(3):196–206, 2006.

[37] Juanzi Li, Jie Tang, Yi Li, and Qiong Luo. RiMOM: A dynamic multistrategy

ontology alignment framework. Knowledge and Data Engineering, IEEE Trans-

actions on, 21(8):1218–1232, 2009.

[38] Doug Martin and James R. Cordy. Analyzing web service similarity using con-

textual clones. In Proceedings of the 5th International Workshop on Software

Clones, pages 41–46, 2011.

[39] Deborah L McGuinness et al. OWL web ontology language overview. http:

//www.w3.org/TR/owl-features/, 2004. Last accessed 01-15-2014.

Page 94: NeCO: Ontology Alignment using Near-miss Clone Detection

BIBLIOGRAPHY 84

[40] George A. Miller et al. WordNet: a lexical database for english. Communications

of the ACM, 38(11):39–41, 1995.

[41] Boris Motik et al. OWL 2 web ontology language structural specifica-

tion and functional-style syntax (second edition). http://www.w3.org/TR/

owl2-syntax/. Last accessed 01-15-2014.

[42] Miklos Nagy, Maria Vargas-Vera, and Enrico Motta. DSSim-managing uncer-

tainty on the semantic web. Ontology Matching, pages 160–169, 2007.

[43] DuyHoa Ngo and Zohra Bellahsene. YAM++ : A multi-strategy based ap-

proach for ontology matching task. In Knowledge Engineering and Knowledge

Management, volume 7603 of Lecture Notes in Computer Science, pages 421–425.

Springer Berlin / Heidelberg, 2012.

[44] N.F. Noy et al. Ontology development 101: A guide to creating

your first ontology. http://www.ksl.stanford.edu/people/dlm/papers/

ontology-tutorial-noy-mcguinness.pdf, 2001. Last accessed 15-01-2014.

[45] Dave Raggett, Arnaud Le Hors, Ian Jacobs, et al. Html 4.01 specification. http:

//www.w3.org/TR/REC-html40/, December 1999. Last accessed 01-15-2014.

[46] Vijay V Raghavan and SKM Wong. A critical analysis of vector space model for

information retrieval. Journal of the American Society for Information Science,

37(5):279–287, 1986.

[47] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine

Learning, 62(1):107–136, 2006.

Page 95: NeCO: Ontology Alignment using Near-miss Clone Detection

BIBLIOGRAPHY 85

[48] Maria Rosoiu et al. Ontology matching benchmarks: Generation and evaluation.

In Proceedings of the 6th ISWC Workshop on Ontology Matching, pages 73–84,

2011.

[49] Chanchal Roy. Detection and Analysis of Near-Miss Software Clones. PhD

thesis, Queen’s University, 2009.

[50] Chanchal K. Roy and James R. Cordy. A survey on software clone detection

research. Queens School of Computing TR, 541:115, 2007.

[51] Chanchal K. Roy and James R. Cordy. Nicad: Accurate detection of near-miss

intentional clones using flexible pretty-printing and code normalization. In The

16th IEEE International Conference on Program Comprehension, pages 172–181.

IEEE, 2008.

[52] Chanchal K Roy, James R Cordy, and Rainer Koschke. Comparison and eval-

uation of code clone detection techniques and tools: A qualitative approach.

Science of Computer Programming, 74(7):470–495, 2009.

[53] Frederik C Schadd and Nico Roos. Maasmatch results for OAEI 2011. Ontology

Matching, pages 171–178, 2011.

[54] Nigel Shadbolt, Wendy Hall, and Tim Berners-Lee. The semantic web revisited.

Intelligent Systems, IEEE, 21(3):96 –101, 2006.

[55] P. Shvaiko and J. Euzenat. Ontology matching: State of the art and future

challenges. 2012.

[56] Temple F Smith and Michael S Waterman. Identification of common molecular

subsequences. Journal of Molecular Biology, 147(1):195–197, 1981.

Page 96: NeCO: Ontology Alignment using Near-miss Clone Detection

BIBLIOGRAPHY 86

[57] Martın Soto. Delta-p: Model comparison using semantic web standards.

Softwaretechnik-Trends, 2:27–30, 2007.

[58] Uthayasanker Thayasivam and Prashant Doshi. Optima results for OAEI 2011.

In Proceedings of the ISWC 2011 Workshop on Ontology Matching, pages 204–

211, 2011.

[59] Quang-Vinh Tran, Ryutaro Ichise, and Bao-Quoc Ho. Cluster-based similarity

aggregation for ontology matching. In Proceedings of the 6th Ontology Matching

Workshop, pages 142–147, 2011.

[60] W3C. Linked data. http://www.w3.org/standards/semanticweb/data. Last

accessed 01-15-2014.

[61] W3C. OWL2 web ontology language primer (second edition). www.w3c.org/TR/

2012/REC-owl2-primer-20121211/. Last accessed 01-15-2014.

[62] Peng Wang. Lily results on SEALS platform for OAEI 2011. In Proceedings of

the 6th Ontology Matching Workshop, pages 156–162, 2011.

[63] Chris Welty, Deborah L McGuinness, and Michael K Smith. OWL web ontol-

ogy language guide. http://www.w3.org/TR/owl-guide/, February 2004. Last

accessed 01-15-2014.

Page 97: NeCO: Ontology Alignment using Near-miss Clone Detection

87

Appendix A

The 2011 Biblio Source Ontology

This is a graphical representation of the seed ontology used in our experiments.1

1http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf


Recommended