SCALING ONTOLOGY ALIGNMENTby
RYAN E. FRECKLETON
B.S.C.p.E, University of Colorado Colorado Springs, 2008
A thesis submitted to the Graduate Faculty of the
University of Colorado Colorado Springs
in partial fulfillment of the
requirements for the degree of
Master of Science in Computer Science
Department of Computer Science
2013
c� Copyright by Ryan E. Freckleton 2013
All Rights Reserved
ii
This thesis for the Master of Science in Computer Science degree by
Ryan E. Freckleton
has been approved for the
Department of Computer Science
by
Dr. Jugal Kalita, Chair
Dr. Charles Shub
Dr. Lisa Hines
Dr. Suzette Stoutenburg
Date
iii
iv
Freckleton, Ryan E. (M.S.C.S., Computer Science)
Scaling Ontology Alignment
Thesis directed by Professor Dr. Jugal Kalita
AbstractAs ontologies become more prevalent in biomedicine and other fields, effective on-
tology alignment is a necessary for their economical and practical use. An ontology is a
group of concepts derived from a corpus of knowledge. Ontology alignment determines the
relationships between these concepts across different ontologies. Therefore ontology align-
ment is an area of active research, especially scaling ontology alignment, as the number and
size of ontologies increases dramatically.
This thesis describes an approach and implementation of ontology alignment called
Parallel Ontology Bridge, which maintains good alignment quality while increasing scala-
bility and speed of ontology alignment by matching linguistic and structural features in a
support vector machine. This approach is based on Ontology Bridge [1] and provides the
same advantages. It is able to handle non-equivalence relationships very effectively and is
a general approach to ontology alignment that can be used across many domains. Parallel
Ontology Bridge increases scalability by using map-reduce, an approach to breaking down
problems and running them in parallel. This thesis describes how this is done. Parallel
Ontology Bridge is almost two orders of magnitude faster than Ontology Bridge and shows
very good scalability while maintaining quality as measured through F-Measure.
The results of Parallel Ontology Bridge are compared against several other scalability
approaches, both with experimental data and theoretical maximum scalability. Parallel
Ontology Bridge is significantly more scalable in the experimental data and maintains this
advantage during theoretical analysis.
v
vi
To my Montessori teacher, who always knew the joy of learning and understanding.
vii
viii
Acknowledgements
I’d like to acknowledge all the people that have positively affected the creation of this
thesis. My employer, The MITRE Corporation, my coworkers, my advisory committee,
my family.
I’d like to especially thank Dr. Suzette Stoutenburg. Without her help and previous
work in this area I would not be able to create this thesis.
I’d especially like to thank my mother, Irene Freckleton and father, Grover Freckleton
for their emotional support as well as deep discussions on the concepts of ontology align-
ment and graphical presentation. I’d also like to thank my Aunt Karen, whose excitement
was infectious and way with words helped make this thesis succinct and clear.
My friend, Dr. Gregory Plett gave me incomparable help and advice with typesetting.
Tim Flink, my friend and fellow graduate student, saw architectural issues I was blind to.
My friend and colleague Dr. Norman Facas gave unparalleled advice on organization
and the appropriate layout of graphs and data.
Thank you Dr. Lisa Hines, for giving me one-on-one attention get up to speed on
biology and medicine. Dr. Charlie Shub, thank you for your continued support and fo-
cus. Your mentorship during my undergraduate studies prepared me for this thesis and my
professional career.
Finally, I’d like to thank my advisor Dr. Jugal Kalita. His expertise in artificial intel-
ligence has been unparalleled.
Without their assistance, feedback and support this would not be possible to complete.
It’s been a long, sometimes stressful journey on this path of knowledge. I appreciate all that
you’ve done for me.
Thank you.
ix
x
Table of Contents
1 Background on Ontologies and
Ontology Alignment 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Ontologies Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Scaling Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Focus of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Background and Organization of Thesis . . . . . . . . . . . . . . . . . . . 8
2 Motivation 9
3 Survey of the State of the Art in
Ontology Alignment 12
3.1 Developments in the State of the Art . . . . . . . . . . . . . . . . . . . . . 13
3.2 Approach to Comparison and Analysis . . . . . . . . . . . . . . . . . . . . 13
3.3 Comparing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Comparison of Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 LOOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 AROMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.3 SOBOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.4 Falcon AO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.5 Stoutenburg Ontology Bridge . . . . . . . . . . . . . . . . . . . . 17
3.4.5.1 Branch and Bound Approach . . . . . . . . . . . . . . . 18
xi
3.5 Survey Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Definitions 19
4.1 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Function Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Scaling Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Ontology Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 The Strategic Approach
of Parallel Ontology Bridge 23
5.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Platelet Activation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.2 Mannose Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.3 Immune System . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.4 Phenylalanine Conversion . . . . . . . . . . . . . . . . . . . . . . 24
5.1.5 Bone Remodeling . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.6 Bone Marrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.7 Osteoblast Differentiation . . . . . . . . . . . . . . . . . . . . . . 25
5.1.8 Osteoclast Differentiation . . . . . . . . . . . . . . . . . . . . . . 25
5.1.9 Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.10 Circadian Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Attempts to Enhance Alignment . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.1 Parallel Human Computation . . . . . . . . . . . . . . . . . . . . . 26
5.2.2 Information Entropy and Morpheme Based Extraction . . . . . . . 26
5.3 Summary of Parallel Ontology Bridge . . . . . . . . . . . . . . . . . . . . 27
5.3.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4 Aligning Ontologies with MapReduce . . . . . . . . . . . . . . . . . . . . 31
5.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 Comparison to Ontology Bridge . . . . . . . . . . . . . . . . . . . . . . . 33
5.7 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
xii
5.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.8.1 Issues With Java Implementation . . . . . . . . . . . . . . . . . . . 38
5.9 Parallel Ontology Bridge F-Measure . . . . . . . . . . . . . . . . . . . . . 39
5.10 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Scalability Results 41
6.1 Scalability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Scalability Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Other Systems Compared . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.1 Gross’s Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.2 Zhang’s Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.6 Comparison of Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.7 Summary of Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Conclusion 51
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A Code Listings 54
A.1 Align Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.2 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.3 Primitives Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.4 OpenCyc Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Bibliography 77
xiii
xiv
List of Tables
2.1 Biomedical Ontology Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Results From Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1 Human Computation Experiment . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Chunking (segmenting strings based on information entropy) Results . . . . 28
5.3 Example Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Unit Test Statement Coverage. This shows how much code in each one of
these modules is covered by automated unit tests. . . . . . . . . . . . . . . 38
5.6 Cross-Validation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Architecture Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Scalability Metrics Comparison . . . . . . . . . . . . . . . . . . . . . . . 49
6.4 Speed of Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xv
xvi
List of Figures
1.1 Gene Ontology. The different shading represents different subdomains. . . 3
1.2 Mammalian Phenotype Ontology . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 Precision and Recall
Public Domain image from WikiMedia . . . . . . . . . . . . . . . . . . . . 20
5.1 Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Performance of Ontology Bridge variants at various numbers of ontology
pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1 Other Systems Curve Fits . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Comparison of scalability of ontology alignment approaches based on data
points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Comparison of extrapolated scalability of ontology alignment approaches . 47
xvii
xviii
CHAPTER 1
Background on Ontologies and
Ontology Alignment
1.1 Purpose
Modern civilization is based on a dynamic and changing foundation of information. There
are 8.7 million species of lifeforms cataloged [2] as well as 19 million articles on Wikipedia
[3] in various languages and an endless number of entries in ontologies. Some of these fast
growing ontologies are in the biomedical field. An ontology is a group of concepts derived
from corpus of knowledge.
Information about biomedicine is being updated on a daily basis [4]. At this rate
the amount of information continues to increase and it is becoming more difficult for re-
searchers to make sense of it [5]. Computer based reasoning holds great promise of a
solution to this problem of increasing information by allowing inferences and deductions
about information and knowledge. To do this economically and practically, it is necessary
to coordinate the effective development and reuse of ontologies. New tools are needed to
meet these new challenges [5].
One of these new tools is ontology alignment. Ontology alignment relates two existing
ontologies to each other. A general approach to ontology alignment is Ontology Bridge,
described by Stoutenburg in [1]. Ontology Bridge finds multiple relationships between
ontologies. Overlapping relationships and concepts are linked by semantic bridges based
on linguistic features, semantic information and structure. It gives good results, but can
be slow for large ontologies. This thesis builds on Ontology Bridge and increases the
scalability and speed of execution. The implementation of this thesis is called Parallel
Ontology Bridge.
2 Chapter 1. Background on Ontologies and Ontology Alignment
Parallel Ontology Bridge improved upon the performance of Ontology Bridge by a
factor of 19 to 48, depending on the test, while maintaining F-Measure. There are few
approaches to scaling ontology alignment [6]. The scalability of Ontology Bridge was in-
creased in [1] by scalability through a branch and bound optimization. In contrast, this
thesis uses parallel execution to increase scalability. The results achieved with this paral-
lelization were so good that it was felt that using branch and bound would not be beneficial,
reducing the quality of alignment for no significant gain in speed.
There are many purposes for ontology alignment [7]. By finding relationships between
objects in disparate models, ontology alignment can be used for data integration. Agents
that use logic programming can use ontology alignment to learn about new domains and
integrate with systems. Aligned ontologies provide a basis for extracting information from
natural language documents.
Larger ontologies allow for more precise and detailed models of the world, which is
the limiting factor in many of these application areas. Ontology alignment also enables
interoperability of independently developed systems by providing an accurate shared se-
mantic vocabulary [7].
As the size of ontologies being aligned increases, the computations necessary grows
quadratically. Therefore, effective approaches to scalability are paramount. The purpose of
this thesis is to provide a solution to scaling ontology alignment.
1.2 Ontologies Used
The two ontologies used in this thesis are the Gene Ontology (GO) [4] (Figure 1.1) and
Mammalian Phenotype Ontology (MP), [8] (Figure 1.2). As of the publication of this
thesis, it is estimated that the Gene Ontology consists of 32,481 OWL [9] classes. The MP
ontology consists of 6,516 ontology classes. OpenCyc, a “universal” ontology, consists
o116,822 OWL classes.
1.2. Ontologies Used 3
Figure 1.1: Gene Ontology. The different shading represents different subdomains.
4 Chapter 1. Background on Ontologies and Ontology Alignment
Figure 1.2: Mammalian Phenotype Ontology
The entirety of OpenCyc1 [10] was used as an upper ontology in Parallel Ontology
Bridge. To reduce necessary execution time for the majority of testing 1,000 class subsets
were used of the Gene Ontology and Mammalian Phenotype Ontology.
The Gene Ontology is a dynamic, controlled vocabulary for eukaryotic cells. It is
actively updated as daily discoveries are made. By gathering and sharing information on
the common genes and proteins, the hope is to help the grand unification of biology, which
is the understanding of all organisms. The information in GO provides strong inference to
the functions of other organisms. The goal is to enable the annotation of the genomes of all
organisms using a shared system of nomenclature and understanding.
1http : //www.cyc.com/platform/opencyc is an ontology and reasoning engine that aims
to cover “common sense”. It defines concepts such as physical, temporal and conceptual entities and therelationships between them.
1.3. Example Applications 5
Individual species and organisms are not represented in GO, nor are specialized or-
gans or body parts. Instead, knowledge from GO is transferred to these specific contexts
through the use of species and anatomical databases. This transfer can be aided by ontology
alignment.
The Mammalian Phenotype Ontology provides a computationally accessible way to
annotate phenotype information to individual genotypes through a shared vocabulary to
describe concepts. Since annotations require more than simple vocabularies, it is imple-
mented as an ontology. Phenotype information tends to be complex and incomplete, so
these constraints are handled directly by MP.
Both these ontologies are continuously updated and exist in complimentary, but sepa-
rate domains. Alignment of these two ontologies will enable further and better annotation
of genotype and gene expression information. Alignment will also benefit other uses of
these ontologies, such as hypothesis generation and collaboration.
1.3 Example Applications
Examples of small-scale ontology alignment include database migration and interoperabil-
ity between different systems. For example, a library that has records using the Dewey
Decimal System needs to be properly aligned to a library using a different system, such
as the Library of Congress Classification. One approach of solving this problem would be
to create an ontology describing each system and match them so that concepts could be
translated between the systems.
1.4 Issues
Often ontology alignment occurs at a small scale and is assisted by humans. This approach
is uneconomical for the large ontologies in the biomedical field. There already exists meth-
ods for matching ontologies [7]. Some use machine learning such as AROMA [11], which
finds associations, and Codi [12], which uses Markov logic to detect similarities [13]. Oth-
ers use natural language processing, such as AgreementMaker [14], which finds lexical
6 Chapter 1. Background on Ontologies and Ontology Alignment
similarity and FALCON-AO [15], which uses statistics on virtual documents [13]. Large
ontologies can take hours or days to complete alignment [16]. Doing ontology alignment
by human means is unfeasibly expensive and time consuming.
1.5 Scaling Ontology Alignment
Generally, the number of operations needed to align two ontologies, o1 and o2, grows
O (m ·n) where
m = |o1|
n = |o2| .
The number of concepts in each ontology has to be compared to concepts in the other
ontology to determine relationships. Using the sizes of ontologies described in Section 1.2,
gives a size on the order of
8⇥ 10
6 · 6⇥ 10
5 ⇡ 5⇥ 10
12
comparisons. Even if each comparison took only one microsecond, that would take ap-
proximately 60 days of processing.
Since each concept pair in the alignment can be compared independently, this is
amenable to a divide and conquer approach. First, using traditional branch and bound tech-
niques as described in [1] and secondly, standard distributed computing and parallelization
[17]. This thesis focuses on the latter technique, reducing the processing time by executing
the necessary comparisons in parallel. Parallel computation is now especially attractive
because of the large amount of cloud computing resources available. The system used in
this thesis was rented on Amazon Web Services for a total of $512.44 in cost and 172.8
hours of execution time.
1.6. Achievements 7
1.6 Achievements
The creation of Parallel Ontology Bridge achieved several advances to the state of the art.
First, it is very scalable because it can take advantage of many processing units. The results
described in this thesis use a 16 core machine, but it is theoretically scalable up to dozens
of processing units.
Secondly, it is quite fast. Parallel Ontology Bridge can process 25,000±3000 pairs per
minute on the 16 core processor system. At this speed, it is able to complete the ontology
alignment of the GO and MP ontologies, which has 210 million possible pairs, in 5.8 days.
Due to a fault in concurrency implementation, this entire run was not totally finished, but
enough results exist to show the scalability and validity of the approach.
The hardware used to run these tests was from Amazon’s new High Performance
Computing2 services, showing that cloud computing is an amenable fit to scaling ontology
alignment. A single instance with 16 cores was created on the Amazon Elastic Computer
Cloud, the software uploaded and installed and the tests executed.
The method of parallelization used, dividing the problem using MapReuce [18] with
some shared resources for ontology graph lookup is simple and unique in this domain.
The details of these results are elaborated in Chapter 6.
1.7 Focus of This Thesis
Ontology Bridge, described in [1], used feature extraction, upper ontologies and support
vector machines [19] to align ontologies. Scalability was done through a branch and bound
algorithm which sacrificed recall for speed.
Parallel Ontology Bridge, described in this thesis, uses the same approach to aligning
ontologies. However, instead of a branch and bound technique, Parallel Ontology Bridge
uses parallelism to increase speed of execution without sacrificing alignment quality.
Ontologies may have many types of relationships within them. Some items in ontolo-
gies are stated to be equivalent to each other, these are like synonyms. Non-equivalence
2http : //aws.amazon.com/hpc-applications/
8 Chapter 1. Background on Ontologies and Ontology Alignment
relationships include all others, such as hypernymy, hyponymy, antonymy or relationships
that are ontology specific.
The algorithm described here is done using non-equivalence relationships. It can be
expanded to equivalence relationships with no loss of generality.
Stoutenburg’s work also described matching relationships defined in the ontologies.
For simplicity, these ontology defined relationships were omitted in this work, only hy-
ponymy relationships were used.
1.8 Background and Organization of Thesis
The work described in this thesis is a novel and practical approach to scalable ontology
alignment in the biomedical domain. First, the motivation for this work will be explained,
then the state of the art will be reviewed. Finally, the novel work will be described. The ap-
proaches described in this thesis provides a scalable, fast and accurate ontology alignment
technique. This will be illustrated by examples in the biomedical domain.
CHAPTER 2
Motivation
There are many purposes for ontology alignment. According to the seminal source for
this topic, [7], some of the most interesting relate to better reasoning over ontologies and
integrating disparate data sources. Logic programming and artificial intelligence require a
good model of a domain to work effectively. As such, they are limited by the accuracy of the
ontologies they use. Better ontologies make logic programming and artificial intelligence
effective and usable.
More complete ontologies enable other applications. Some examples of these ap-
plications are machine translation, strong artificial intelligence and ontology extraction.
Ontology alignment also enables interoperability of independently developed systems.
By providing a shared semantic vocabulary, ontology alignment allows for the trans-
lation of data and information between systems. A simple example of this is two systems
with different schemas. If the schemas are matched, they can interoperate by translating
the appropriate fields.
Because ontology alignment is generally an O (n2) problem, scaling ontology align-
ment is important [1]. As described in Chapter 5, the alignment process can be easily
broken into parallel pieces. Biomedical ontologies are especially affected by scaling, since
they tend to be large. Some biomedical ontologies and their sizes are given in Table 2.1.
If a scalable approach to ontology alignment is not found, they cannot be aligned in a
reasonable amount of time [20].
Alignments must be high quality to be applied effectively. If mistakes are included
in the ontology it will cause incorrect conclusions to be reached. Performance, scalability
and precision are some of the key measures of quality for an ontology alignment algorithm.
Performance is important because the ontologies must be aligned in a reasonable amount
of time.
10 Chapter 2. Motivation
Ontology SizeInfluenza Ontology 1,368
Mammalian Phenotype 130,268Gene Ontology 878,379NCI Thesaurus 1,758,354
Table 2.1: Biomedical Ontology Sizes
Specific use cases of ontology alignment include, but are not limited to:
• Catalog integration [7], offering products from different vendors on a single portal,
• Data integration [21], combining data sources into a single view for consumption,
• Data extraction from biomedical texts using natural language processing [5], for in-
stance, building biomedical ontologies from research papers or textbooks,
• Peer-to-peer information sharing [22], such as between online agents which solve
problems autonomously,
• Inference on biomedical information [23], like bioinformatics prediction,
• Data exchange among biomedical applications [24], for example health care databases,
• Computer reasoning with biomedical data [5], such as hypothesis generation,
• Decision support [25], such as automatic diagnostics,
• Federated databases[7], integrating multiple databases from different enterprises, and
• Encyclopedic knowledge [5][7], for example annotated Wikipedia and DBpedia.
The number of biomedical researchers interested in biomedical ontology has been rapidly
expanding and it is difficult to make sense of the new biomedical information available [5].
Practitioners hope that ontology alignment tools will be incorporated into BioPortal [26], a
website with many ontologies, and similar resources.
Ontologies could enable the large majority of data produced by the spectrum of life
sciences to be easily retrieved and understood by those working in these fields [27][5].
Chapter 2 Motivation 11
These benefits are similar to what happened in chemistry after the introduction of the pe-
riodic table. Scientists used the same symbols and categorizations for the elements so
researchers could understand the experiments. With ontologies biomedical information
similarly can be understood with a universal model allowing measurement and prediction
across biomedical sub-domains. Ontology alignment is one of the tools needed to handle
this huge task [5].
12
CHAPTER 3
Survey of the State of the Art in
Ontology Alignment
3.1 Developments in the State of the Art
In the past few years there have been great advances in ontology alignment. Scalability
has dramatically improved. Various research groups are increasing their efforts to align
biomedical ontologies [20].
Previously, many automatic ontology matchers took hours or days to align larger on-
tologies, in contrast, modern systems, such as Falcon-AO, only take minutes to complete
[6]. Fifteen different research groups took part in the OAIE1 2012 large scale ontology
tests, more than twice the number of total participants in the 2004 OAIE [28][29].
Only recently larger scale tests, those with tens or hundreds of thousands of items
for ontology alignment, have been created. This is still an area of active research and
development [29].
Over the years both the accuracy and performance of large-scale ontology alignment
have improved [28]. This is crucial because the size and number of ontologies used by
biomedical researchers continues to increase rapidly [28].
3.2 Approach to Comparison and Analysis
Ontology alignment is similar to information retrieval in that the quality is subjectively
measured, there is no perfect mathematical definition of correctness [28]. Because of this,
1The Ontology Alignment Evaluation Initiative – An annual “competition” for ontology alignment algo-rithm researches described in [28]
14 Chapter 3. Survey of the State of the Art in Ontology Alignment
there can be multiple correct alignments for two given ontologies. For the purposes of
the OAEI, it is assumed that there exists a unique and ideal reference alignment between
any two ontologies [28]. Ontologies have distinct and sometimes contradictory ways of
classifying data [30]. This makes it a challenge to align ontologies, especially when they
are designed for different purposes. For example, WordNet [31] does not relate the words
“renal” and “kidney” directly, but uses a special relationship called “pertainymy” to connect
them [30].
Various ontology alignment techniques have been tested through the Ontology Align-
ment Evaluation Initiative since 2007. Recently they have started looking at scalability
[28]. Table 3.1 shows a comparison of precision, recall and runtime for some of the algo-
rithms compared below. These algorithms generally ran on 2.0 to 3.1 GHz processors with
2 to 4 GB of RAM. These were run against the Adult Mouse Anatomy (2744 classes) [32]
and the NCI Thesaurus (3304 classes) [33] except when otherwise noted [20].
3.3 Comparing Approaches
The approaches described in Table 3.1 vary from heuristic approaches to machine learning.
Some of them are very well described, while others have details which are opaque in the
published literature [20]. In addition to the variation in approach, the ontologies and rela-
tionships used to test alignment also varied. This table gives a rough understanding of the
diversity of results, performance and approaches in ontology alignment.
Most of the state of the art systems takes less than an hour to do equivalence relation-
ship alignments on the OAEI O (3,000 · 3,000) test case. Precision tends to be much higher
than recall, from 0.77 to 0.99. Recall is around 0.52 to 0.77, depending on the algorithm.
3.4 Comparison of Systems
The following subsections describe several ontology alignment projects selected for their
high accuracy, simplicity or uniqueness of approach. They represent a summary from the
state of the art for biomedical ontology alignment.
3.4. Comparison of Systems 15
Table 3.1: Results From Literature
Project Precision Recall Runtime Ontology SizeLOOM 0.99 0.65 ? O (3,000 · 3,000)[34]
AROMA 0.775 0.678 ~1 minute O (3,000 · 3,000)[20]SOBOM 0.952 0.777 19 minutes O (3,000 · 3,000)[20]
Falcon AO 0.964 0.591 12 minutes O (3,000 · 3,000)[35]Stoutenburg (super) 0.84 0.55 96 hours O (1,000 · 1,000)[1]
Stoutenburg (sub) 0.93 0.54 96 hours O (1,000 · 1,000)[1]Stoutenburg (ontology defined) 0.62 0.52 96 hours O (1,000 · 1,000)[1]
3.4.1 LOOM
Described in [34], LOOM is a simple approach to ontology alignment that uses string
normalization and string comparison, producing highly precise results with good recall.
Remarkably, this seemingly naive approach provides better results than some approaches
that use machine learning. This is likely due to the ontologies selected, the Adult Mouse
Anatomy and a part of the NCI Thesaurus. These ontologies have been going through a
process of label harmonization which increased the correlation of concepts within them2.
Its simplicity and precision makes this an interesting and practical approach for align-
ing biomedical ontologies. The LOOM approach compares ontology classes using string
comparison in these two steps:
1. Normalize ontology class titles by removing all delimiters from strings (spaces and
punctuation) and normalize case.
2. Match strings approximately. Allow for a mismatch of no more than one character in
strings with length greater than four and no mismatches for shorter strings.
This heuristic (2) can be replaced with an exact string match to boost precision. In specific
instances precision was much higher than the OAEI reference alignment. Precision is the
strength of this algorithm.
2http : //oaei.ontologymatching.org/2012/anatomy/
16 Chapter 3. Survey of the State of the Art in Ontology Alignment
3.4.2 AROMA
AROMA [36] compares vocabularies used to describe ontologies through statistical analy-
sis. It measures the number of words used to describe concepts. If a concept, A, is described
with a subset of the words used to describe B, that implies that A is a more generic type of
B.
These relations are found through association mining [37], an unsupervised machine
learning algorithm that finds association rules. Association rules are inferences about what
concepts are found together and which concepts imply other concepts. For an example, an
association rule would be in the form
{French fries, shakes} ) Hamburger
states that when people buy French fries and shakes at a restaurant they also buy hamburg-
ers.
These association rules are filtered using implication intensity, a measure of the num-
ber of expected and observed counter-examples [36]. This method is capable of finding
hyponymy and hypernymy relationships.
This method is incredibly fast, however it does not have outstanding performance
for either precision or recall. It is also one of the few methods besides [38] and [1] that
describes hyponymy and hypernymy matching.
3.4.3 SOBOM
SOBOM [38] uses a series of steps for alignment. Firstly, “anchor concepts” are found
between the ontologies. These are concepts that have precise equivalence and are not leaf
nodes. Using these anchors as roots, sub-ontologies are segmented and aligned. These
sub-ontologies are matched using a similarity propagation graph.
Secondly, additional semantic information is used to align non-superclass relation-
ships. The details of these non-superclass relationship matches are not clearly explained or
cited.
3.4. Comparison of Systems 17
This method gives impressive results. The details of how this is accomplished is not
elaborated in the paper. Whether this is a domain specific approach or can be used in other
areas is not known. This is one of the few methods besides [36] and [1] that describes
hyponymy and hypernymy matching.
3.4.4 Falcon AO
Falcon AO [15] combines graphical and linguistic methods for matching. First good align-
ments are created on some objects, these are then expanded to match other items.
This allows for a partitioning approach to scalability, reducing the number of compar-
isons as large ontologies are aligned.
Some work has been done on using the VDoc algorithm of Falcon AO with a MapRe-
duce framework to increase scalability further [39].
3.4.5 Stoutenburg Ontology Bridge
The Stoutenburg Ontology Bridge algorithm [1] uses a combination of support vector ma-
chines (SVMs) [40], upper ontologies and natural language processing.
Pairs of concepts between the ontologies are enumerated and compared. Approxi-
mately two dozen features are extracted from each pair. These features are compared by
using a radial-basis function SVM [41] which infers what relations exists between the con-
cepts in the pair. The relationships supported are hyponymy, hypernymy and ontology
defined relations.
Using SVMs has several drawbacks. They are relatively slow and there are only a few
implementations available. Also, they require training and parameter tuning. The results
from an SVM cannot be explained. A numerical value is provided and normalized, but it
doesn’t necessarily reflect the quality of matches.
Upper ontologies enable better matching by mapping the meaning of labels to deep
semantic information. This allows for “common sense” reasoning about the ontologies as
they’re being aligned. Finding relations in these upper ontologies is often slow because of
their size and complexity. The OpenCyc project [10] software is especially complex and
memory intensive.
18 Chapter 3. Survey of the State of the Art in Ontology Alignment
3.4.5.1 Branch and Bound Approach
In addition to the primary approach above, [1] describes a branch and bound algorithm for
scaling ontology alignment. This approach can trade recall for time so it allows for high
precision alignments with reduced execution time. This branch and bound algorithm relies
on the ontologies being well structured in order to select ontology pairs to compare and
align.
3.5 Survey Results
There still are only a few ontology alignment systems that handle non-equivalence relation-
ships, specifically Ontology Bridge, SOBOM and AROMA. The methods used for align-
ment have a large diversity of approaches. Some systems, such as Falcon-AO, combine
multiple approaches, which seems to give better results.
CHAPTER 4
Definitions
This chapter defines the technical terms used in this thesis.
4.1 Performance Measures
Ontology alignment uses precision, recall and F-measure to determine quality [42]. Algo-
rithms produce results which are compared against a “reference alignment”. A reference
alignment is a ontology alignment which has been verified to be correct.
In Figure 4.1, the filled in dots are all relevant pieces of information while the retrieved
information is within the oval. Errors are shown in gray.
Recall Measures how many of the relevant relationships the algorithm found. In Figure
4.1, it is denoted by the R$, the white oval area divided by the gray area to the left.
Precision Measures that the alignments found are relevant. In Figure 4.1, it is denoted by
P$, the white oval area divided by the gray oval area.
F-Measure Measures overall quality. It is the harmonic mean of precision and recall
F = 2
precision · recallprecision+ recall
.
20 Chapter 4. Definitions
Figure 4.1: Precision and RecallPublic Domain image from WikiMedia
Defining these mathematically gives
Recall (A,R) =
|R \ A||R|
and
Precision (A,R) =
|R \ A||A|
where A is the set of alignments detected from the ontology alignment system (the dots and
circles within the oval in Figure 4.1) and R is the set of true alignments (the black dots on
the right hand side in Figure 4.1).
4.2 Tools
SVM Support Vector Machine [43]. This is a machine learning algorithm that creates a
high dimensional space based on many features. Using training data, it creates a
maximum margin hyperplane between the classes it is to discriminate between. For
this thesis, a two class support vector machine is used. A SVM reports the distance
from the separating margin between the two classes, in some cases this can be used
to gauge how likely it is to be correct. For this thesis, the distance from the margin is
discarded during classification. The report from the SVM is only used to determine
which side of the margin the class is on.
4.3. Function Primitives 21
Upper Ontology An ontology that describes abstract or broad terms that can be used
across contexts.
4.3 Function Primitives
map A function that takes in another function and executes it on all the items of a sequence.
map(f, [a,b,c, ...]) = [f (a) , f (b) , f (c) , ...]
reduce A function that takes in another function and executes it on a sequence, taking the
previous result as the operand. reduce(f, [a,b,c, ...]) = f (a, f (b, f (c, ...)))
product A function that produces the Cartesian Product of two sequences.
product ([a, b, c..], [A,B,C, ...]) = [(A, a) , (A, b) , (A, c) , (B, a) , ...]
4.4 Scaling Nomenclature
parallelism Running multiple calculations concurrently on separate processors to reduce
execution speed.
parallelization Changing operations to work in parallel.
4.5 Ontology Nomenclature
Ontology An ontology consists of a vocabulary that describes a specific domain and the
definitions of the terms in that vocabulary in a formal manner [7]. Ontologies model
entities, assign their significances and group them based on relationships [44][30].
Ontology Alignment The process of creating a set of correspondences between ontologies
is called ontology alignment [7]. Concepts in each ontology are related to one another
by equivalence, hyponymy, hypernymy or other relations. This process is called
schema matching when it is done with format schemas instead of ontologies [7].
22 Chapter 4. Definitions
Hyponymy The relationship of being more specific: “Dog is a hyponym of animal.”
Hypernymy The relationship of being more general: “Animal is a hypernym of dog.”
MP Mammalian Phenotype Ontology1. This ontology covers mammalian phenotypes.
Phenotypes are the physical attributes of organisms. Most of its data is based on
experimental results from mice and rats. These experiments are on organisms that
have been selectively bred, genetically engineered or mutated to show certain traits.
GO Gene Ontology2. This ontology covers general information about gene expression,
metabolism, and cellular processes. It can be used to annotate results from bioinfor-
matic experiments. Much of the information in the Gene Ontology has come from
comparing the genomes of various organisms.
1http : //www.informatics.jax.org/searches/MP_form.shtml
2http : //www.geneontology.org/
CHAPTER 5
The Strategic Approach
of Parallel Ontology Bridge
This chapter discusses the approach to ontology alignment used in the work of this thesis.
It covers the experimental data, the algorithms and the implementation. Comparison with
Ontology Bridge from [1] is emphasized.
Parallel Ontology Bridge has the same general form as Ontology Bridge. Features are
extracted, SVMs are used and non-equivalence relations are detected. The main difference
is that Parallel Ontology Bridge executes these steps across multiple, concurrently running
jobs.
5.1 Test Data
The biomedical reference alignments generated for [1] were used in development and test-
ing. They provided the training data used throughout this thesis and were used to determine
the quality of alignment. These test cases are relatively small, with around 10 concepts
from each ontology. The hyponymy reference alignments were used for the testing of this
work. There are 10 biomedical test-cases covering various concepts and the relationships
between them. These concepts are as follows:
5.1.1 Platelet Activation
The conditions under which platelets cohere to one another and related activities, i.e. scab-
bing. This can be triggered by various events, such as the platelet encountering collagen or
other proteins. As a platelet activates, it changes shape to a more amorphous form, adheres
24 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
to other platelets and promotes coagulation reactions. This test case consists of 10 concepts
from the GO ontology such as cell activation, blood coagulation and platelet activation and
4 concepts from the MP ontology such as abnormal platelet physiology and hematopoietic
system phenotype.
5.1.2 Mannose Binding
The process of certain proteins binding to the surfaces of pathogen. Deficiency of mannose
binding is associated with higher rates of infection. This test case consists of six concepts
in GO relating to binding and 6 concepts in MP relating to immune system and protein
physiology.
5.1.3 Immune System
The system of the body which fights pathogens and foreign elements. This test had eight
concepts concepts related to immune response in GO while the seven MP concepts relate
to immune system physiology, response and phenotype.
5.1.4 Phenylalanine Conversion
The processes of converting the amino acid phenylalanine into other amino acids, such as
tyrosine. This test had 18 concepts from GO related to metabolic processes were selected,
while 6 concepts related to activity were selected from MP. These require semantic features
to appropriately match.
5.1.5 Bone Remodeling
The process of mature bone tissue being removed from the skeleton and new tissue being
formed in its place. This is done, respectively, by osteoclasts and osteoblasts. This test had
9 concepts related to regulation, resorption and remodeling in GO and 6 in MP related to
remodeling, physiology and increase/decrease of resorption.
5.2. Attempts to Enhance Alignment 25
5.1.6 Bone Marrow
The tissue inside of bones which creates various cells and components of blood. This test
case has 5 concepts in GO related to development and morphogenesis and 4 concepts in
MP related to development and morphology.
5.1.7 Osteoblast Differentiation
How the cells that create bone tissue are created. This test has 7 concepts from GO about
osteoblast differentiation, ossification and regulation of osteoblast differentation. Only one
concept from MP was selected, abnormal osteoblast differentation.
5.1.8 Osteoclast Differentiation
How the cells that destroy bone tissue are created. This test has 4 concepts from GO about
regulation of osteoclast differentation and osteoclast differentation. One concept from MP
was used, abnormal osteoclast differentation.
5.1.9 Behavior
The actions of an organism in response to stimulus. The concepts in both ontologies were
generic, 4 concepts in GO related to behavior and regulation of behavior and 2 concepts in
MP: abnormal behavior and behavior phenotype.
5.1.10 Circadian Rhythm
Processes that have a daily cycle. In GO these concepts were circadian rhythm, regulation
of circadian rhythm and response to external stimulus. In MP the single concept selected
was abnormal circadian rhythm.
5.2 Attempts to Enhance Alignment
In addition to the primary thrust of increasing scalability, this thesis did some work on
novel ways to enhance alignment quality. These results are described in this section.
26 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
5.2.1 Parallel Human Computation
In human computation, sometimes called crowd-sourcing, a large number of people solve a
problem by connecting and collaborating through the Internet. A problem is broken down
into pieces that can be done in small increments by many participants.
This approach was attempted for ontology alignment. The Mechanical Turk1 service
of Amazon Web Services was used. Mechanical Turk offers a cheap crowd-sourcing plat-
form for companies and individuals to manage tasks that can be delegated to users across
the world. The labels and descriptions of potential matches were given to independent
participants who selected what relationships existed between the pairs.
These experiments gave unsatisfactory results. The results were much worse than
random selections of alignments. Only one participant out of twenty may get an alignment
correct. This is likely due to lack of expertise in the biomedical domain for Mechanical
Turk participants. Unlike in simple domains, the population does not converge on a correct
result. The details of this experiment are given in Table 5.1. Participants took less than a
minute on average to determine results, and were given a 1 cent bounty. These tasks were
based on the MP and GO ontology test cases from [1], which had full results to compare
against.
There are several approaches that may help with this. Using confusion matrices [45],
an approach to noisy classifiers, has had good results in other domains. Breaking the prob-
lem into very small tasks, which are used as input to an SVM, may be helpful. This is
similar to the approach used in [46] for crowd-sourced elaborate editing tasks and [47]
which uses crowd-sourcing for training an SVM.
5.2.2 Information Entropy and Morpheme Based Extraction
An information entropy and morpheme based extraction was also attempted. Inspired by
[34], ontology concept labels were compared, weighing the characters by their information
entropy. Based on this entropy, they were grouped into “chunks” that were compared.
Words in titles were broken into chunks based on information entropy. Entropy measures
1https : //www.mturk.com/
5.3. Summary of Parallel Ontology Bridge 27
Table 5.1: Human Computation Experiment
Variable ValueNumber of participants 200
Average time per task 47 secondsBounty per match $0.01
Matches per participant 1Total number of pairs 200
how informative characters or strings of characters are in predicting the rest of the title.
This was not very successful, as can be seen in Table 5.2 on the following page.
The baseline results are F-Measures for the test cases used on the approach described
by Stoutenburg [1]. Top 4 Stems shows the performance of creating a feature vector based
on the top 4 most distinct stems extracted from the titles and descriptions. Chomp shows
extractions of phonemes at various cut-off levels.
? is shown for the cases where either the retrieved documents or relevant documents
retrieved were zero. The baseline is the F-Measure based on the existing features, without
the addition of the morpheme based extraction.
5.3 Summary of Parallel Ontology Bridge
The non-optimized approach of Ontology Bridge described in [48] can be summarized as
follows:
1. For the two ontologies being aligned, the classes were paired from each ontology.
This creates a Cartesian product of classes between the two ontologies.
2. For each pair of classes, features were extracted.
(a) Features include information in upper ontologies, linguistic features of labels
and structural features of ontologies. These features are all numerical in nature.
(b) These features were normalized into feature vectors for use in a radial-basis
function SVM.
28 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
Table5.2:C
hunking(segm
entingstrings
basedon
information
entropy)Results
TestCase
Baseline
Top4
Stems
Chom
pC
ount1C
homp
Count2
Chom
pC
ount3PlateletA
ctivation0.707
?0.717
0.5130.717
Mannose
Binding
0.685?
??
?
Imm
uneSystem
0.6230.639
??
?
PhenylalanineC
onversion0.424
?0.164
0.4480.448
Bone
Modeling
0.709?
??
?
Bone
Marrow
0.692?
?0.513
?
OsteoblastD
ifferentiation0.707
??
?0.717
OsteoclastD
ifferentiation0.707
??
??
Behavior
0.700?
0.710?
0.710C
ircadianR
hythm0.700
?0.513
0.7100.710
5.3. Summary of Parallel Ontology Bridge 29
3. Based on these features, pairs that have relationships were selected using an SVM.
The SVM has two classes for each relationship, that the relationship exists or it does
not exist.
Parallel Ontology Bridge, the implementation of this thesis, re-implements the approach
above with the additional enhancement of running feature extraction and pair selection with
an SVM on multiple processors. This dramatically increases scalability. For additional
performance, some feature calculations are optimized by storing precomputed information
in lookup tables.
5.3.1 MapReduce
Parallel Ontology Bridge is an implementation of the Ontology Bridge algorithm which
runs in parallel over multiple processing units. It does this by using MapReduce. MapRe-
duce is described in [18].
The name MapReduce comes from the two concurrency primitives used during exe-
cution. The function map takes two parameters, a function and a sequence. It runs the
function on every item in the sequence. The function map can be executed on multiple
processors with little effort. Input data is partitioned, scheduled and executed on a number
of workers. This is carried out by a master process which manages workers and assigns
tasks. Each worker receives input data and processes it using the function passed into map.
Similarly, reduce also has parameters that are a function and a sequence. But unlike map,
reduce returns a single value by executing the function over the pairs of the sequence.
Here MapReduce is illustrated with an example analysis of a sum of squares expres-
sion. The sum of squares is used when calculating various quantities in statistics. In com-
pact notation it is
S =
NX
i
x2i .
Calculating this by hand can be done by “unrolling” the summation and doing each
operation in sequenceNX
i
x2i = x2
1 + x22 + x2
3 + ...+ x2N .
30 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
This expression has a few interesting properties. One, each squaring of the terms is inde-
pendent of all the others. We could give each x2i term to a separate processor and then add
them together.
The other property of note is that items can be added in any order. We could divide
the problem into pieces such as
NX
i
x2i =
⇣x21 + x2
2 + ...+ x2
bN2 c
⌘+
⇣x2
bN2 c+1
+ ...+ x2N
⌘.
This means the problem can be broken into small pieces and executed concurrently
and reassembled easily with no change in the calculations that have occurred. MapReduce
does this by rewriting the problem in terms of two higher order functions. Higher order
functions are functions which take functions as arguments. For the squaring, rewriting the
sequence⇥x21, x
22, x
23, ..., x
2N
⇤
gives
f (x) = x2
)⇥x21,x
22,x
23, ...,x
2N
⇤= [f(x1),f(x2),f(x3), ...,f(xN)]
= map(f, [x1,x2,x3, ...,xN ]) ,
an expression in terms of map.
Similarly, the addition can be rewritten in terms of reduce
g (a,b) = a+ b
) x1 + x2 + x3 + ...+ xN = g (x1, g (x2, g (x3, g (...,xN))))
= reduce(g, [x1,x2,x3, ...,xN])
5.4. Aligning Ontologies with MapReduce 31
If an algorithm can be written in this form of using map and reduce, then it can
be trivially parallelized across many processors with minimal contention between shared
resources.
5.4 Aligning Ontologies with MapReduce
Ontology alignment can be thought of in the following manner (see Figure 5.1). Each
ontology is represented as a graph. Each concept is represented as a vertex in a graph. A
vertex in Ontology A is matched to a vertex in Ontology B. When a new relationship exists
between these two vertices, it is represented as an edge.
Figure 5.1: Ontology Alignment
Rewriting Ontology Bridge in pseudo-code gives Algorithm 5.1. Algorithm 5.1 can
be explained as follows:
Step 1 For each pair of classes between the two ontologies,
Step 2 extract features from them.
Step 3 If there is a relationship based on these features, add a match between these
classes .
This approach can be rewritten using higher order functions, such as map (Algorithm
5.2 where align_pair is described in Algorithm 5.3).
32 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
Algorithm 5.1 Naive Matching
f o r ( c1 , c2 ) in p r o d u c t ( o1 , o2 ) : # S t e p 1f e a t u r e s = g e t _ f e a t u r e s ( c1 , c2 ) # S t e p 2i f h a s _ r e l a t i o n ( f e a t u r e s ) : # S t e p 3
add_match ( c1 , c2 )
Algorithm 5.2 Matching with higher-order functions
map ( a l i g n _ p a i r , p r o d u c t ( o1 , o2 ) )
Algorithm 5.3 align_pair implementation
def a l i g n _ p a i r ( p a i r ) :c1 , c2 = p a i rf e a t u r e s = g e t _ f e a t u r e s ( c1 , c2 )i f h a s _ r e l a t i o n ( f e a t u r e s ) :
add_match ( c1 , c2 )
In this work, map was implemented as a pool of processes, called “jobs”, running
on separate processors, thereby allowing parallelization of ontology alignment. These jobs
were given batches of 100 ontology class pairs process at a time. These batches were
collected in a “master” process which pushed the job to waiting “worker” processes through
inter-process communication queue. Results similarly fed back to the master through a
inter-process communication queue.
get_features and add_match were potential bottlenecks for parallelization be-
cause they made use of shared resources.
5.5 Architecture
Figure 5.2 shows the architecture. This was based on [1]. Two input ontologies are given
to the system in OWL XML format, pairs are created from these ontologies and distributed
to the worker jobs. These pairs consist of one class from each ontology. Each worker
job extracts feature primitives and determines the relationships that exist using an SVM.
5.6. Comparison to Ontology Bridge 33
Figure 5.2: Architecture
Finally, these results from all the workers are combined to create a new alignment. The
features extracted look up information in upper ontologies such as OpenCyc and WordNet.
The code is organized into small modules, simplifying the system architecture and
aiding debugging, analysis and modification.
The code as implemented is in Appendix ??. This system was run on a system with
two Intel Xeon E5-2670 Sandy Bridge processors. Each one of these processors has 8
cores, giving a total of 16 processing units available. During execution, 16 worker jobs
were created with one “master” processor coordinating data between. This gave the best
performance, likely due to the master node using taking advantage of the IO concurrency.
5.6 Comparison to Ontology Bridge
Scalability was evaluated in the same manner as [1]. Subsets of the ontologies being aligned
were selected randomly, sampling 100, 500 and 1,000 classes from each ontology. The
execution time was measured as these tests were run with various numbers of concurrent
jobs. These results were analyzed to determine the scalability of the system in Chapter 6.
The performance (i.e. time taken) of Ontology Bridge, Ontology Bridge with Branch
and Bound and Parallel Ontology Bridge is shown in Figure 5.3. Ontology Bridge’s time
increases quadratically with the number of class pairs, while Ontology Bridge with Branch
34 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
Figure 5.3: Performance of Ontology Bridge variants at various numbers of ontology pairs
and Bound does not grow as fast. The performance of Parallel Ontology Bridge is almost
flat, although it grows quadratically, it grows at much slower rate.
5.7 Feature Extraction
As illustrated in Figure 5.2, classes are extracted from the ontologies. Features are ex-
tracted from these class entries and fed into a support vector machine. OpenCyc is accessed
through a cache, significantly reducing the amount of time to use the OpenCyc ontologies to
create features. Software timers and calculations of alignment metrics are also integrated.
Various graphs and plain text output of results were generated for diagnosis.
Features are turned into a vector sequence that’s normalized during SVM processing.
Table 5.3a shows an example ontology pair. The row “Origin Ontology” contains the ontol-
ogy where the class came from, the row “Class Label” contains the label given the class and
5.7. Feature Extraction 35
Class A Class BOrigin Ontology GO MP
Class Label negative regulation of platelet activation abnormal platelet activationParent Classes abnormal platelet physiology
(a) Ontology Pair for Feature Extraction
Primitive Name Valuecount_opencyc_synonyms 1count_wordnet_synonym 4
has_matching_labels 0has_same_first_word 0has_same_last_word 1
(b) Feature Results
Table 5.3: Example Feature Extraction
the row “Parent Classes” is a comma delineated list of superclasses of the selected class.
Table 5.3 shows the features extracted for these two classes. An example feature extraction
would take the ontology pair shown in Table 5.3a and turn it into the vector [1,4,0,0,1]
with the features in Table 5.3.
The individual vectors are made up of binary values, integers or real numbers. They
are all normalized during SVM training to be in the range [0, 1]. It is relatively easy to
create additional features. Table 5.4 has the list of features used for this thesis. To allow for
valid comparisons, these features are the same as used in Stoutenburg’s work [1]. This re-
quired porting from Java to the Python programming language and some tuning by trial and
error to find the appropriate normalization schemes. These features were originally created
based on analysis of biomedical ontologies assisted by biology and medical experts. These
features make up patterns of relationships between biomedical ontologies. For example
titles that end with the same words are likely related by hyponymy, such as “platelet activa-
tion” and “abnormal platelet activation” and titles that have synonyms in them are likely to
be equivalent or related. Since these features are consistent with the heuristics developed
by these experts, they make a good choice for this domain.
Linguistic features, such as number of words in the labels, structural features, for
example many child concepts a class has, and features looked up in upper ontologies, like
36 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
how many synsets the words in two labels have in WordNet or how many shared concepts
they have in OpenCyc, are included in Ontology Bridge and Parallel Ontology Bridge.
5.8 Implementation
The software was implemented in Python using the scikit-learn2 and joblib3 libraries. The
input ontologies were in Web Ontology Language (OWL) format. This is a W3C standard-
ized format for ontologies that can be serialized into XML and other text formats. Two
approaches were used to access upper ontologies. WordNet was accessed through the nat-
ural language tool kit (nltk4), further described in [50]. OpenCyc, which consists of an
upper ontology in OWL format and a reasoning engine, was accessed through its ontology
file. The ontology was extracted into a graph data structure with references to the necessary
relations put into a fast lookup table. The reasoner was not used in this implementation.
The scikit-learn library had the necessary interfaces to libSVM, an implementation of
SVMs that performs well. It included grid search for the parameters of the SVM kernel.
This was used to determine the constants during training.
The joblib library had the implementation of MapReduce used in this thesis. The
number of jobs used during a run was a parameter into this method. In addition, some
joblib caching was used to reduce the load times of various files and training data.
The software used for this thesis has progressed through several iterations of develop-
ment. The final version is written in the Python programming language.
Test-driven development, writing automated test cases before production code [51],
has been used whenever possible. As such, a suite of unit tests exists for most of the
functionality of this software. Table 5.5 shows the test coverage of the various modules of
the implementation in this thesis.
2http : //scikit-learn.org/stable/
3http : //packages.python.org/joblib/
4http : //nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
5.8. Implementation 37
Feature Primitive Name Descriptioncount_opencyc_hypernyms Number of words that are hypernyms through OpenCyccount_opencyc_hyponyms Number of words that are hyponyms through OpenCyccount_opencyc_synonyms Number of words that are synonyms through OpenCyc
count_wordnet_hypernyms Number of words that are hypernyms through WordNetcount_wordnet_hyponyms Number of words that are hyponyms through WordNetcount_wordnet_synonym Number of words that are synonyms through WordNet
has_matching_labels Whether any labels matchhas_opencyc_subclass_synonym Whether there is a synonym of concept subclass through
OpenCychas_opencyc_superclass_hypernym Whether there is a hypernym of concept superclass through
OpenCychas_opencyc_superclass_hyponym Whether there is a hyponym of concept superclass through
OpenCychas_opencyc_superclass_synonym Whether there is a synonym of concept superclass through
OpenCychas_opencyc_synonym Whether any OpenCyc synonym exists
has_same_beginning Whether there is a shared substring at the starthas_same_ending Whether there is a shared substring at the end
has_same_first_word Whether the first word is the samehas_same_label Whether primary label matches
has_same_last_word Whether the last word is the samehas_stoilos_similarity Stoilos similarity metric [49]
has_sub_prefix Starts with “sub”has_superclass_1 First concept has superclasseshas_superclass_2 Second concept has superclasses
has_wordnet_hypernym Has any hypernym in common in wordnethas_wordnet_hyponym Has any hyponym in common in wordnet
has_wordnet_subclass_synonym Whether there is a synonym of concept subclass throughWordNet
has_wordnet_superclass_hyperym Whether there is a hypernym of concept superclass throughWordNet
has_wordnet_superclass_hyponym Whether there is a hyponym of concept superclass throughWordNet
has_wordnet_superclass_synonym Whether there is a synonym of concept superclass throughWordNet
has_wordnet_synonym Whether any WordNet synonym existsis_opencyc_hypernym Whether any OpenCyc hypernym existsis_opencyc_hyponym Whether any OpenCyc hyponym exists
Table 5.4: Features
38 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
Module Name Statements Missed Coveragealign 49 15 69%
import_csv 53 28 47%opencyc 21 9 57%
primitives 245 149 39%utils 25 13 48%Total 393 214 54%
Table 5.5: Unit Test Statement Coverage. This shows how much code in each one of thesemodules is covered by automated unit tests.
There are 32 tests which take 1.85 seconds to run. These tests primarily cover the
functionality of feature primitives, with a few tests for metric calculation, training data
parsing and simple scenarios of alignment.
5.8.1 Issues With Java Implementation
Initially, this software was written in Java, taking advantage of several existing libraries to
deal with ontologies. However, this proved untenable for the following reasons:
Jena This library, the industry standard for dealing with ontologies in Java, was unable to
handle the large number of ontology pairs used in this thesis. Its performance was
poor. The python equivalent, rdflib, used in this thesis, was able to easily handle
large ontologies.
WordNet This resource was difficult to incorporate with the rest of the Java software. It
uses external files in non-standard ways, which makes distributing in a jar file very
difficult. WordNet functions are not idempotent or reentrant, which made multi-
threading very difficult. The Python library did not run into these difficulties.
OpenCyc The author found it extremely slow to use the OpenCyc reasoning engine. The
alternative, the OpenCyc OWL file, was very large and cumbersome. This was mit-
igated by caching. The OpenCyc OWL file was stored in a Python dictionary and
serialized into a b-tree based file, which allowed for very rapid lookup.
5.9. Parallel Ontology Bridge F-Measure 39
5.9 Parallel Ontology Bridge F-Measure
As described in [1], cross validation was used on the reference alignments to determine
F-measure for the hyponymy relationship (shown in Table 5.6). Ten way cross validation
separates the test data into 10 disjoint sets and runs tests on each one, using the remaining
9 sets of data as training. These results are generally consistent with the best results from
[1], which had an average F-Measure of 0.68 for the same test cases. This implies that
the implementation of Parallel Ontology Bridge performs as well as Ontology Bridge for
F-measure.
Based on a cursory analysis of the data, it seems that more generic sub-domains, such
as behavior and immune system have better results than more specific sub-domains such as
Mannose Binding and Bone Modeling.
Phenylaanine Conversion specifically is the only case of several features related to
OpenCyc, since this is the case, no other training data is available in the other domains, so
it does significantly more poorly than the others.
Table 5.6: Cross-Validation Results
(a) Parallel Ontology Bridge
Test Case F-MeasurePlatelet Activation 0.707Mannose Binding 0.685Immune System 0.623
Phenylalanine Conversion 0.424Bone Modeling 0.709Bone Marrow 0.692
Osteoblast Differentiation 0.707Osteoclast Differentiation 0.707
Behavior 0.7Circadian Rhythm 0.7
Average 0.67
(b) Ontology Bridge [1]
F-MeasureBiomedical Test 1 0.7Biomedical Test 2 0.69Biomedical Test 3 0.58
Average 0.68
40 Chapter 5. The Strategic Approach of Parallel Ontology Bridge
5.10 Contributions
This thesis implemented Parallel Ontology Bridge, a parallel method of executing the
Ontology Bridge algorithm. The algorithm is broken into map and reduce steps to run
alignments on individual classes concurrently. This approach shows good F-measure on
hyponymy relationships and is generic enough to be used with other ontology alignment
methods and features.
CHAPTER 6
Scalability Results
The scalability of a system refers to its ability to handle larger problems, sometimes by
adding additional resources. In this thesis, it refers specifically to how a system performs
as more computation resources are added.
6.1 Scalability Metrics
For computer systems, there are two major metrics that affect scalability, contention delay
and coherency delay. Contention delay is caused by sequential execution of computations.
This can be due either to the structure of the problem or contention over shared resources.
Coherency delay is the time required for caches and memory hierarchies to be updated with
appropriate data. Coherency delays are always due to implementation of the system.
These relationships are shown in Gunther’s Universal Scalability Law [52], a model
for scalability that will be used throughout this thesis. Gunther’s Universal Scalability law
is
C (p) =p
1 + � (p� 1) + p (p� 1)
where p is the number of processing units, � is contention delay and is coherency
delay. To determine the scalability of a system, data points are fit to this model. The
maximum performance of a system is at p⇤ processing units, given by
p⇤ =
$r1� �
%.
42 Chapter 6. Scalability Results
Adding more processes above p⇤ either does not change or reduce performance. Adding
additional processes after p⇤ is counterproductive because the performance of the system is
reduced as the number of processors added grows above p⇤.
6.2 Scalability Experiments
To analyze the scalability of the Parallel Ontology Bridge system, several experiments
were run with various numbers of computation units available. These data points were
fit to Gunther’s Universal Scalability Law using least squares regression. Least squares
regression is a technique for fitting curves by minimizing the squared error of data points
versus the model.
These data points are shown in Figure 6.1a. Similar data already existed for the other
systems compared, these data are shown in the other sub-figures of Figure 6.1. All of these
data come from running tests on an Amazon HPC Instance with 60.5 GB of RAM and
two Intel Xeon E5-2670, eight-core “Sandy Bridge” processors. This is one of many cloud
computing services offered by Amazon.
In addition to the subsets described below, a full alignment of GO and MP was at-
tempted. This failed due to a contention fault which only occurs rarely, a mutable data
structure does not have an appropriate mutex in some of the library code that was used.
This full scalability test would have compared 211 million pairs in 5.8 days, but only 84
million pairs were successfully processed. Of these 84 million pairs, 8 million were shown
to have a relationship.
Since the algorithm remains unchanged from the test cases described in Section 5.9,
additional human evaluation is unnecessary. For commercial application, additional tuning
and testing is necessary, but these tests should be sufficient to show the scalability of the
approach.
6.3. Other Systems Compared 43
6.3 Other Systems Compared
6.3.1 Gross’s Approaches
Figure 6.1c shows Gross’s [53] intra-node system performance data points and the least
squares fitted curve, while Figure 6.1b shows the same for Gross’s inter-node system [53].
Intra-node means “within one node”, in this case, one computer, while inter-node means
“between nodes”, in this case, multiple computers. Gross’s approaches run various “match-
ers” on ontology pairs in a parallel manner on a single computer. Matchers include string
similarity measures, structural comparisons and semantic lookups. Gross’s inter-node sys-
tem takes the same approach as Gross’s intra-node, except it runs across multiple machines
in a distributed system. This paper only describes the infrastructure to run the matches, not
combine the outputs.
This system ran on the Adult Mouse Anatomy1 (MA) and NCI Thesaurus2 as its test
ontologies. These ontologies are approximately 3000 concepts in size and are used as the
Anatomy track for the OAEI.
6.3.2 Zhang’s Approach
Zhang’s approach is a Hadoop based system [39]. Its performance data points and least
squares fitted curve are shown in Figure 6.1d. This approach constructs “virtual docu-
ments” and uses a term frequency inverse document frequency (TF-IDF) [54] metric for
ontology alignment implemented in map-reduce. TF-IDF is a kind of measure for docu-
ment retrieval, balancing the frequency of specific words against their uniqueness across
documents.
This system ran on the FMA3 and GALEN4 ontologies.
1http : / / www . obofoundry . org / cgi-bin / detail . cgi ? id = adult _ mouse _
anatomy
2http : //ncit.nci.nih.gov/
3http : //sig.biostr.washington.edu/projects/fm/AboutFM.html
4http : //www.co-ode.org/galen/
44 Chapter 6. Scalability Results
(a) Parallel Ontology Bridge Curve Fit (b) Gross Inter
(c) Gross Intra (d) Zhang
Figure 6.1: Other Systems Curve Fits
6.4 Input
Table 6.1 shows the ontologies used for these scalability tests and their size. The dif-
ferent input data should not affect the scalability results, since all the values used in the
comparisons are normalized. Systems that use larger input datasets may however uncover
coherency delays if the approach to cache invalidation is not optimal.
The input data for Parallel Ontology Bridge was selected at random from all the pos-
sible pairs.
6.5 Hardware
Table 6.2 shows a comparison of the hardware used in each system. Zhang’s Hadoop
system had the most number of processors and RAM available. Gross’s intra-node had
6.6. Comparison of Scalability 45
System Ontology size
Parallel Ontology BridgeGene Ontology Subset 1,000Mammalian Phenotype Subset 1,000
Gross InterGO Molecular Functional 9,395GO Biological Process 17,104
Gross IntraAdult Mouse Anatomy 3,289NCI Thesaurus (Anatomy) 2,737
ZhangFMA ⇠ 40,000
GALEN ⇠ 40,000
Table 6.1: Input Data
the fewest number of processors and least amount of RAM. Zhang’s approach and Gross’s
inter-node were both distributed systems over a network.
System CPUs Cache Memory ArchitectureParallel Ontology Bridge 16x2.60GHz 20 MB 60.5 GB Single Computer
Gross Inter 16x2.66GHz 8 MB 4x4 GB NetworkGross Intra 4x2.66GHz 8 MB 4 GB Single Computer
Zhang 40x2.4GHz 12 MB 10x24 GB Network
Table 6.2: Architecture Comparison
6.6 Comparison of Scalability
Figure 6.2 shows the scalability of Parallel Ontology Bridge, Gross’s intra-node, Gross’s
inter-node and Zhang’s Hadoop approaches. The ideal system is one that has no contention
or coherency delay. Parallel Ontology Bridge starts to perform worse from the ideal pos-
sible scalability at more than 15 concurrent jobs. Since Parallel Ontology Bridge is the
closest to the ideal, it has the best scalability. Gross’s inter-node approach is the next clos-
est, so it shows the second best scalability. Gross’s inter-node approach performs worse
than the ideal at more than 4 concurrent jobs. Gross’s intra-node approach performs better
than Zhang’s Hadoop approach when fewer than 8 jobs are executed in parallel.
This worse performance may be due to the lower quality hardware used in these other
systems or it may be due to poorer software architecture which causes more contention.
46 Chapter 6. Scalability Results
Figure 6.2: Comparison of scalability of ontology alignment approaches based on datapoints
Figure 6.3 shows the system behavior extrapolated beyond the experimental data. Par-
allel Ontology Bridge has maximum effectiveness at approximately 60 jobs and a 30 times
speedup. Gross’s intra-node approach continues to increase in performance slowly, reach-
ing maximum effectiveness with thousands of jobs executing in parallel. Gross’s inter-node
approach declines rapidly after 30 jobs while Zhang’s approach starts to decline slowly at
approximately 100 jobs.
The retrograde performance of Parallel Ontology Bridge is due to it’s coherency delay.
Since Gross’s intra-node approach has significantly lower coherency delay, but higher con-
tention delay, its performance continues to improve slowly as more processors are added.
Zhang’s approach has poor coherency and contention delay compared to the other
approaches, this is why it does not increase in performance early on and plateaus.
6.6. Comparison of Scalability 47
Cross’s inter-node approach has better contention delay than coherency delay, this is
why it shows improvement early on, but has significant retrograde performance as more
processors are added.
Figure 6.3: Comparison of extrapolated scalability of ontology alignment approaches
Table 6.3 shows the scalability metrics for these systems. Parallel Ontology Bridge
has very small contention delay and small coherency delay. In contrast, Gross’s inter-node
system has moderate contention delays and very low coherency delays. This is why Gross’s
inter-node approach does not scale as well as Parallel Ontology Bridge when running a few
jobs, but continues to scale after Parallel Ontology Bridge shows retrograde performance.
Coherency causes performance to drop as the number of jobs increases.
Gross’s inter-node and Parallel Ontology Bridge had dramatically better scalability
than the other two approaches that were compared. This better performance is evident in
both graphs, both experimentally at a small scale and theoretically at a large scale. This is
due to their better scalability metrics.
48 Chapter 6. Scalability Results
Zhang’s Hadoop approach has significantly more contention delays than any of the
other systems that were compared. This is possibly due to the use of a single “house-
keeping” node which holds all of the persisted data during execution.
Gross’s intra-node approach has some contention delay and the largest coherency de-
lay. This large coherency delay is what causes it to have retrograde performance so early
on.
Parallel Ontology Bridge has the highest theoretical speedup of any system, a speedup
of 30.6 when 60 processors are used, giving an efficiency of 51%. Gross’s inter-node
approach has the second best speedup at 128,309 processors, giving the worst efficiency of
any system at 0.02%.
Gross’s intra-node system also has good efficiency. The efficiency of Gross’s intra-
node system and Parallel Ontology Bridge may be due to them communicating over a
system bus instead of the network (see Table 6.2).
The overall speed of Parallel Ontology Bridge is much faster than Ontology Bridge.
Table 6.4 shows the execution speed of Ontology Bridge (96 hours) versus that of Parallel
Ontology Bridge (40 minutes).
6.7 Summary of Scalability
This chapter compared results of ontology alignment at the system level. This is the only
appropriate comparison to be made with existing results, since the underlying hardware and
system architecture are different. The causes of scalability performance depend on many
factors. How the problem is broken down, the processor architecture used, how they are
connected, the speed of these connections, how contention is handled and how much and
what type of memory is available.
The Parallel Ontology Bridge system showed better scalability than many similar ap-
proaches that have been published.
6.7. Summary of Scalability 49
Syst
emC
onte
ntio
n�
Coh
eren
cy
p⇤sp
eedu
patp⇤
effic
ienc
yatp⇤
Para
llelO
ntol
ogy
Brid
ge1.01⇥10
�9
0.000272
6030
.651
%G
ross
Inte
r0.0336
5.87⇥10
�11
128,
309
29.7
0.02
%G
ross
Intra
1.93⇥10
�9
0.0123
94.
7753
%Zh
ang
0.0953
0.000115
888.
659.
8%
Tabl
e6.
3:Sc
alab
ility
Met
rics
Com
paris
on
50 Chapter 6. Scalability Results
Table 6.4: Speed of Execution
Project Runtime Ontology SizeOntology Bridge 96 hours O (1,000 · 1,000)[1]
Parallel Ontology Bridge 40 minutes O (1,000 · 1,000)[1]
CHAPTER 7
Conclusion
This work describes Parallel Ontology Bridge, an approach to ontology alignment using
support vector machines to find non-equivalence relationships that scales through the use
of parallelization. Parallel Ontology Bridge maintained the alignment quality of the previ-
ous work, Ontology Bridge, at an F-Measure of 0.67, while reducing execution time from
96 hours to 2 hours. In addition, it showed a theoretical scalability factor of 30 with an ef-
ficiency of 51%. This shows that Parallel Ontology Bridge is very scalable. The other sys-
tems compared only had a maximum scalability factor of 29 and efficiency of 9.8%. Like
Ontology Bridge, ontologies were aligned by matching linguistic and structural features in
a support vector machine. However, unlike Ontology Bridge, Parallel Ontology Bridge can
be scaled across many processing units. By using MapReduce with shared memory, Par-
allel Ontology Bridge offers a simple, unique method to parallelizing ontology alignment
which is domain independent.
This work explored the use of MapReduce, human computation, information entropy
and morpheme based extraction and cloud computing for ontology alignment. MapRe-
duce proved to be a straightforward method of implementing this parallelization. It’s a
clear match for the Ontology Bridge system and has been proven in industry. Google uses
MapReduce for their PageRank algorithm which determines the order of search results in
their search engine. Parallel Ontology Bridge shows that a similar approach of investing in
parallel infrastructure can solve the ontology alignment scalability problem. This parallel
infrastructure can come from cloud providers, such as Amazon1 or RackSpace2.
Scalable ontology alignment is a necessary technology for the Semantic Web [55],
both in its full operation and for integrating with existing systems. Scalability ontology
1http : //aws.amazon.com
2http : //rackspace.com/cloud
52 Chapter 7. Conclusion
alignment is still an unsolved problem in the field. Parallel Ontology Bridge shows a
direction, mimicking some of the same outcomes that occurred in more straightforward
document analysis and search engine development.
Parallel Ontology Bridge combines a modular design, MapReduce distribution of jobs,
caching and cloud computing to provide an effective solution to scaling ontology align-
ment. With appropriate training data, it can automatically tune the support vector machine
to optimally align ontologies, eliminating one form of manual tuning.
There were several challenges with this system. The features from Ontology Bridge
had to be ported and appropriately tuned to get the necessary F-measure that Ontology
Bridge showed. Tuning primarily consisted of selecting appropriate feature normalization.
The goal of Parallel Ontology Bridge was to match the F-measure of Ontology Bridge.
This thesis is the first paper in the ontology alignment community which discusses
theoretical scalability of systems. None of the other papers compared address it. Some
papers describe approaches that were similar to this one, but none went into the detail of
analyzing theoretical scalability.
Success and failure of this project was based on two measures: scalability and F-
measure. This project succeeded on both, maintaining F-measure while dramatically in-
creasing scalability.
This thesis contributes an end-to-end parallelization technique to the ontology align-
ment community, as well as the first reported use of cloud computing resources for ontology
alignment.
The work was implemented and empirically shown to scale. The approach shows
promise for scaling ontology alignment because of its good empirical results and simplicity
of approach. There may be additional bottlenecks when using a different hardware system
that cannot be detected from the research in this thesis.
Scaling ontology alignment is still a developing field [28] which has great potential
applications in biomedicine [30].
7.1. Future Work 53
7.1 Future Work
Ontology alignment is a key component to enhancing the research uses of medical ontolo-
gies. Ontology alignment tools need to be better integrated into the work flows of ontology
researchers. Effective research assistance and diagnostic tools need to be developed, as
well as methods for on demand or just-in-time processing of ontology alignment. This
would enable integrated and up to date ontologies to be used in biomedical research with-
out additional effort or cost on the part of the researchers. A well aligned ontology is not
useful if it remains stagnant.
For Parallel Ontology Bridge to be used broadly, an effective approach to gathering
training data is necessary. Bootstrapping techniques with training data and using an in-
cremental learning SVM, which will allow for interactive and continuous training, offer
potential solutions to this problem.
There is future work to incorporate the results and approaches of other ontology align-
ment systems into the architecture described in this thesis. In addition, there is much more
testing and research to compare the results of this thesis to the Ontology Alignment Evalu-
ation Initiative, specifically the physiology and scalability tracks. To do this, an approach
to unsupervised learning for equivalence relationships may be necessary. Unsupervised
learning does not require data to be labeled, and labeling data can be a labor intensive
process.
Future work includes modifying Parallel Ontology Bridge to be a distributed system.
This may require further analysis and design of an approach of sharing data across multiple
processing nodes.
54
APPENDIX A
Code Listings
A.1 Align Implementation
Listing A.1: align.py
from _ _ f u t u r e _ _ import d i v i s i o n
import s y s
import r d f l i b
import numpy
from i t e r t o o l s import p r o d u c t , s t a r m a p
from s k l e a r n import svm , g r i d _ s e a r c h , m e t r i c s , p r e p r o c e s s i n g
c l a s s Br id ge ( ) :
def _ _ i n i t _ _ ( s e l f , name , t r a i n i n g , f e a t u r e s , e x p e c t e d
= [ ] ) :
s e l f . r e l a t i o n _ n a m e = name
s e l f . f e a t u r e s = f e a t u r e s
s e l f . c l a s s i f i e r = C l a s s i f i e r ( t r a i n i n g )
s e l f . e x p e c t e d = e x p e c t e d
s e l f . r e s u l t s = [ ]
s e l f . n e w _ r e l a t i o n s = [ ]
56 Appendix A. Code Listings
def f ( s e l f , c1 , c2 ) :
f e a t u r e _ v e c t o r = [ f ( c1 , c2 , s e l f . g1 , s e l f . g2 ) f o r f
in s e l f . f e a t u r e s ]
match = s e l f . c l a s s i f i e r . c l a s s i f y ( f e a t u r e _ v e c t o r )
l a b e l _ 1 = s t r ( s e l f . g1 . l a b e l ( c1 ) )
l a b e l _ 2 = s t r ( s e l f . g2 . l a b e l ( c2 ) )
a s _ e x p e c t e d = ( l a b e l _ 1 , l a b e l _ 2 ) in s e l f . e x p e c t e d
re turn ( c1 , c2 , f e a t u r e _ v e c t o r , i n t ( match ) ,
a s _ e x p e c t e d )
def a l i g n ( s e l f , g1 , g2 ) :
s e l f . g1 = g1
s e l f . g2 = g2
o1 = g e t _ c l a s s e s ( g1 )
o2 = g e t _ c l a s s e s ( g2 )
s e l f . r e s u l t s = l i s t ( s t a r m a p ( s e l f . f , p r o d u c t ( o1 , o2 ) )
)
s e l f . n e w _ r e l a t i o n s = [ ( c1 , c2 )
f o r ( c1 , c2 , _ , match , _ ) in
s e l f . r e s u l t s ]
def t e s t ( s e l f , t e s t _ d a t a ) :
t u p l e s , l a b e l s = z i p (⇤ t e s t _ d a t a )
r e s u l t s = [ s e l f . c l a s s i f i e r . c l a s s i f y ( t ) f o r t in
t u p l e s ]
re turn m e t r i c s . p r e c i s i o n _ r e c a l l _ f s c o r e _ s u p p o r t (
l a b e l s , r e s u l t s ,
A.1. Align Implementation 57
l a b e l s
= [ 1 , 0 ] )
def p r o g r e s s ( i , t o t a l _ c o m p a r i s o n s , found ) :
s y s . s t d o u t . w r i t e ( ’ \ r ’ )
s y s . s t d o u t . w r i t e ( " { : , } / { : , } c o m p a r i s o n s made , { : , } found
. "
. f o r m a t ( i +1 ,
t o t a l _ c o m p a r i s o n s , found )
)
s y s . s t d o u t . f l u s h ( )
c l a s s C l a s s i f i e r ( ) :
def _ _ i n i t _ _ ( s e l f , d a t a ) :
t u n e d _ p a r a m e t e r s = [ { ’ k e r n e l ’ : [ ’ r b f ’ ] , ’gamma ’ : [1 e
�3, 1e �4] ,
’C ’ : [ 1 , 10 , 100 , 1 0 0 0 ] } ,
{ ’ k e r n e l ’ : [ ’ l i n e a r ’ ] , ’C ’ : [ 1 ,
10 , 100 , 1 0 0 0 ] } ]
s e l f . svm = g r i d _ s e a r c h . GridSearchCV ( svm . SVC ( ) ,
t u n e d _ p a r a m e t e r s )
s e l f . t r a i n ( d a t a )
def c l a s s i f y ( s e l f , v e c t o r ) :
r e s u l t = s e l f . svm . p r e d i c t ( [ v e c t o r ] )
re turn r e s u l t
def t r a i n ( s e l f , d a t a ) :
t u p l e s , l a b e l s = z i p (⇤ d a t a )
58 Appendix A. Code Listings
s e l f . svm . f i t ( p r e p r o c e s s i n g . n o r m a l i z e ( numpy . a r r a y (
t u p l e s , d t y p e = f l o a t ) ) , l a b e l s )
def g e t _ c l a s s e s ( g ) :
re turn l i s t ( g . s u b j e c t s ( r d f l i b . RDF . type , r d f l i b .OWL. C l a s s
) )
A.2 Parallel Implementation
Listing A.2: joblib_align.py
import i t e r t o o l s
import p i c k l e
import r d f l i b
from d a t e t i m e import d a t e t i m e
from r d f l i b import OWL, RDF
from s k l e a r n . e x t e r n a l s import j o b l i b
import a l i g n
import i m p o r t _ c s v
import p r i m i t i v e s
memory = j o b l i b . Memory ( c a c h e d i r =" cache " )
i m p o r t _ c s v . i m p o r t _ c s v = memory . cache ( i m p o r t _ c s v . i m p o r t _ c s v )
f e a t u r e s = p r i m i t i v e s . members
r e l a t i o n = " hyponymy "
tc_names = [ ’ 01 _ p l a t e l e t _ a c t i v a t i o n ’ ,
’ 02 _mannose_b ind ing ’ ,
A.2. Parallel Implementation 59
’ 03 _immune_system ’ ,
’ 04 _ p h e n y l a l a n i n e _ c o n v e r s i o n ’ ,
’ 05 _bone_model ing ’ ,
’ 06 _bone_marrow ’ ,
’ 07 _ o s t e o b l a s t _ d i f f e r e n t i a t i o n ’ ,
’ 08 _ o s t e o c l a s t _ d i f f e r e n t i a t i o n ’ ,
’ 09 _ b e h a v i o r ’ ,
’ 10 _ c i r c a d i a n _ r h y t h m ’ ]
c a s e s = [ i m p o r t _ c s v . i m p o r t _ c s v ( t c , r e l a t i o n , f e a t u r e s ) f o r
t c in t c_names ]
@memory . cache
def g e t _ t r a i n i n g ( tc_name ) :
o t h e r _ c a s e s = ( c a s e f o r ( t c , c a s e ) in z i p ( tc_names ,
c a s e s ) i f t c != tc_name )
t r a i n i n g = sum ( o t h e r _ c a s e s , [ ] )
re turn t r a i n i n g
def p a r s e _ o n t ( f i l e n a m e ) :
g = r d f l i b . Graph ( )
g . p a r s e ( f i l e n a m e )
re turn g
c l a s s O n t o l o g y S u b s t i t u t e ( ) :
def _ _ i n i t _ _ ( s e l f , f i l e n a m e ) :
s e l f . o n t = p a r s e _ o n t ( f i l e n a m e )
def o b j e c t s ( s e l f , s u b j e c t =None , p r e d i c a t e =None ) :
re turn s e l f . o n t . o b j e c t s ( s u b j e c t , p r e d i c a t e )
def s u b j e c t s ( s e l f , p r e d i c a t e =None , o b j e c t =None ) :
60 Appendix A. Code Listings
re turn s e l f . o n t . s u b j e c t s ( p r e d i c a t e , o b j e c t )
def l a b e l ( s e l f , s u b j e c t , d e f a u l t = ’ ’ ) :
re turn s e l f . o n t . l a b e l ( s u b j e c t , d e f a u l t )
def g r o u p e r ( n , i t e r a b l e ) :
" C o l l e c t d a t a i n t o f i x e d�l e n g t h chunks o r b l o c k s "
# grouper ( 3 , ’ABCDEFG ’ , ’ x ’ ) ��> ABC DEF Gxx
a r g s = [ i t e r ( i t e r a b l e ) ] ⇤ n
re turn ( i t e r t o o l s . i f i l t e r ( None , g )
f o r g in i t e r t o o l s . i z i p _ l o n g e s t ( f i l l v a l u e =None ,
⇤ a r g s ) )
i f __name__ == ’ __main__ ’ :
tc_name = ’ a l l ’
t r a i n i n g = g e t _ t r a i n i n g ( tc_name )
g1 = O n t o l o g y S u b s t i t u t e ( " t e s t _ c a s e s / "+ tc_name+" /
GOTestCase . owl " )
g2 = O n t o l o g y S u b s t i t u t e ( " t e s t _ c a s e s / "+ tc_name+" /
MPTestCase . owl " )
g 1 _ c l a s s e s = g1 . s u b j e c t s (RDF . type , OWL. C l a s s )
g 2 _ c l a s s e s = g2 . s u b j e c t s (RDF . type , OWL. C l a s s )
p a i r s = i t e r t o o l s . p r o d u c t ( g 1 _ c l a s s e s , g 2 _ c l a s s e s )
c = a l i g n . C l a s s i f i e r ( t r a i n i n g )
def f ( c1 , c2 ) :
f e a t u r e _ v e c t o r = [ f ( c1 , c2 , g1 , g2 ) f o r f in
f e a t u r e s ]
match = c . c l a s s i f y ( f e a t u r e _ v e c t o r )
l a b e l _ 1 = s t r ( g1 . l a b e l ( c1 ) )
l a b e l _ 2 = s t r ( g2 . l a b e l ( c2 ) )
re turn ( c1 , c2 , i n t ( match ) )
A.3. Primitives Implementation 61
s t a r t = d a t e t i m e . now ( )
j o b s = j o b l i b . P a r a l l e l ( n _ j o b s =16 , v e r b o s e =1 ,
p r e _ d i s p a t c h = ’ 100⇤ n _ j o b s ’ )
f o r s u b s e t in g r o u p e r (25000 , p a i r s ) :
r e s u l t s = j o b s ( j o b l i b . d e l a y e d ( f ) ( c1 , c2 )
f o r ( c1 , c2 ) in s u b s e t )
end = d a t e t i m e . now ( )
p r i n t " t ime e l a p s e d " , s t r ( end � s t a r t )
f o r a , b , m in r e s u l t s :
p r i n t s t r ( a ) , s t r ( b ) , m
A.3 Primitives Implementation
Listing A.3: primitives.py
" " "
E v i d i e n c e p r i m i t i v e e x t r a c t o r s , a l l f o l l o w t h e g e n e r i c form
o f
f ( c1 , c2 , o1 , o2 )
where c1 i s a c l a s s from o n t o l o g y o1 and c2 i s a c l a s s from
o n t o l o g y o2 .
" " "
import i n s p e c t
import s y s
import s h e l v e
import opencyc
62 Appendix A. Code Listings
import r d f l i b
from i t e r t o o l s import p r o d u c t , s t a r m a p
from c o n t e x t l i b import c l o s i n g
from n l t k . c o r p u s import wordne t
from r d f l i b import RDFS
from u t i l s import g e t _ l a b e l s
r d f l i b . p l u g i n . r e g i s t e r ( ’ t e x t / xml ’ , r d f l i b . p l u g i n . P a r s e r ,
’ r d f l i b . p l u g i n s . p a r s e r s . r d f xm l ’ , ’
RDFXMLParser ’ )
opencyc_db = opencyc . OpenCyc ( )
def h a s _ s a m e _ l a b e l ( c1 , c2 , o1 , o2 ) :
re turn o1 . l a b e l ( c1 , d e f a u l t =None ) == o2 . l a b e l ( c2 ,
d e f a u l t =None ) != None
def count_wordnet_synonym ( c1 , c2 , o1 , o2 ) :
c o u n t = 0
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
c o u n t += l e n ( s e t ( wordne t . s y n s e t s ( w1 ) ) & s e t ( wordne t .
s y n s e t s ( w2 ) ) )
re turn c o u n t
def coun t_wordne t_hypernyms ( c1 , c2 , o1 , o2 ) :
c o u n t = 0
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
A.3. Primitives Implementation 63
f o r s2 in wordne t . s y n s e t s ( w2 ) :
c o u n t += sum ( s1 in hypernyms ( s2 , 5 ) f o r s1 in
wordne t . s y n s e t s ( w1 ) )
re turn c o u n t
def count_wordnet_hyponyms ( c1 , c2 , o1 , o2 ) :
re turn F a l s e
def coun t_opencyc_hypernyms ( c1 , c2 , o1 , o2 ) :
c o u n t = 0
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
c o u n t += w1 in a n c e s t o r s ( w2 )
re turn c o u n t
def count_opencyc_hyponyms ( c1 , c2 , o1 , o2 ) :
c o u n t = 0
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
c o u n t += w2 in a n c e s t o r s ( w1 )
re turn c o u n t
def h a s _ s a m e _ b e g i n n i n g ( c1 , c2 , o1 , o2 ) :
re turn any ( l 1 . s t a r t s w i t h ( l 2 ) or l 2 . s t a r t s w i t h ( l 2 )
f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) )
def has_same_end ing ( c1 , c2 , o1 , o2 ) :
re turn any ( l 1 [ : : � 1 ] . s t a r t s w i t h ( l 2 [ : : � 1 ] ) or l 2 [ : : � 1 ] .
s t a r t s w i t h ( l 2 [ : : � 1 ] )
f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) )
def synonym ( w1 , w2 ) :
64 Appendix A. Code Listings
re turn w1 == w2 and w1 in opencyc_db
def count_opencyc_synonyms ( c1 , c2 , o1 , o2 ) :
re turn sum ( synonym ( w1 , w2 ) f o r ( w1 , w2 ) in
g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) )
def h a s _ s a m e _ f i r s t _ w o r d ( c1 , c2 , o1 , o2 ) :
re turn any ( l 1 . s p l i t ( ) [ 0 ] == l 2 . s p l i t ( ) [ 0 ]
f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) )
def h a s _ s a m e _ l a s t _ w o r d ( c1 , c2 , o1 , o2 ) :
re turn any ( l 1 . s p l i t ( ) [�1] == l 2 . s p l i t ( ) [�1]
f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) )
def h a s _ m a t c h i n g _ l a b e l s ( c1 , c2 , o1=None , o2=None ) :
re turn any ( l 1 == l 2 f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 ,
o1 , o2 ) )
def h a s _ s u b _ p r e f i x ( c1 , c2 , o1 , o2 ) :
re turn any ( s t a r m a p ( s u b _ p r e f i x , g e t _ w o r d _ p a i r s ( c1 , c2 , o1
, o2 ) ) )
def h a s _ s u p e r c l a s s _ 1 ( c1 , c2 , o1 , o2 ) :
re turn boo l ( l i s t ( g e t _ p a r e n t s ( c1 , o1 ) ) )
def h a s _ s u p e r c l a s s _ 2 ( c1 , c2 , o1 , o2 ) :
re turn boo l ( l i s t ( g e t _ p a r e n t s ( c2 , o2 ) ) )
def ha s_o pen cy c_s ubc l a s s_s yno nym ( c1 , c2 , o1 , o2 ) :
f o r s u b c l a s s in s u b c l a s s e s ( c1 , o1 ) :
A.3. Primitives Implementation 65
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u b c l a s s , c2 , o1 , o2 )
:
i f w1 == w2 and w1 in opencyc_db : re turn True
re turn F a l s e
def h a s _ o p e n c y c _ s u p e r c l a s s _ h y p e r n y m ( c1 , c2 , o1 , o2 ) :
f o r s u p e r c l a s s in g e t _ p a r e n t s ( c1 , o1 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u p e r c l a s s , c2 , o1 ,
o2 ) :
i f w1 in a n c e s t o r s ( w2 ) : re turn True
re turn F a l s e
def h a s _ o p e n c y c _ s u p e r c l a s s _ s y n o n y m ( c1 , c2 , o1 , o2 ) :
f o r s u p e r c l a s s in g e t _ p a r e n t s ( c1 , o1 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u p e r c l a s s , c2 , o1 ,
o2 ) :
i f w1 == w2 and w1 in opencyc_db : re turn True
re turn F a l s e
def h a s _ o p e n c y c _ s u p e r c l a s s _ h y p o n y m ( c1 , c2 , o1 , o2 ) :
f o r s u p e r c l a s s in g e t _ p a r e n t s ( c1 , o1 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u p e r c l a s s , c2 , o1 ,
o2 ) :
i f w2 in a n c e s t o r s ( w1 ) : re turn True
re turn F a l s e
def s u b c l a s s e s ( c l s , o n t ) :
re turn o n t . s u b j e c t s (RDFS . subClassOf , c l s )
def has_wordnet_synonym ( c1 , c2 , o1 , o2 ) :
66 Appendix A. Code Listings
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
i f s e t ( wordne t . s y n s e t s ( w1 ) ) & s e t ( wordne t . s y n s e t s ( w2
) ) :
re turn True
re turn F a l s e
def has_wordnet_hypernym ( c1 , c2 , o1 , o2 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
f o r s1 in wordne t . s y n s e t s ( w1 ) :
f o r s2 in wordne t . s y n s e t s ( w2 ) :
i f s1 in hypernyms ( s2 , 5 ) : re turn True
re turn F a l s e
def has_wordnet_hyponym ( c1 , c2 , o1 , o2 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
f o r s1 in wordne t . s y n s e t s ( w1 ) :
f o r s2 in wordne t . s y n s e t s ( w2 ) :
i f s2 in hypernyms ( s1 , 5 ) :
re turn True
re turn F a l s e
def h a s _ w o r d n e t _ s u b c l a s s _ s y n o n y m ( c1 , c2 , o1 , o2 ) :
f o r s u b c l a s s in s u b c l a s s e s ( c1 , o1 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u b c l a s s , c2 , o1 , o2 )
:
i f s e t ( wordne t . s y n s e t s ( w1 ) ) & s e t ( wordne t .
s y n s e t s ( w2 ) ) :
re turn True
re turn F a l s e
A.3. Primitives Implementation 67
def h a s _ w o r d n e t _ s u p e r c l a s s _ s y n o n y m ( c1 , c2 , o1 , o2 ) :
f o r s u p e r c l a s s in g e t _ p a r e n t s ( c1 , o1 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u p e r c l a s s , c2 , o1 ,
o2 ) :
i f s e t ( wordne t . s y n s e t s ( w1 ) ) & s e t ( wordne t .
s y n s e t s ( w2 ) ) :
re turn True
re turn F a l s e
def h a s _ w o r d n e t _ s u p e r c l a s s _ h y p o n y m ( c1 , c2 , o1 , o2 ) :
f o r s u p e r c l a s s in g e t _ p a r e n t s ( c1 , o1 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u p e r c l a s s , c2 , o1 ,
o2 ) :
f o r s1 in wordne t . s y n s e t s ( w1 ) :
f o r s2 in wordne t . s y n s e t s ( w2 ) :
i f s2 in hypernyms ( s1 , 5 ) :
re turn True
re turn F a l s e
def h a s _ w o r d n e t _ s u p e r c l a s s _ h y p e r y m ( c1 , c2 , o1 , o2 ) :
f o r s u p e r c l a s s in g e t _ p a r e n t s ( c1 , o1 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u p e r c l a s s , c2 , o1 ,
o2 ) :
f o r s1 in wordne t . s y n s e t s ( w1 ) :
f o r s2 in wordne t . s y n s e t s ( w2 ) :
i f s1 in hypernyms ( s2 , 5 ) :
re turn True
re turn F a l s e
def hypernyms ( s y n s e t , l e v e l ) :
68 Appendix A. Code Listings
i f l e v e l == 1 :
re turn s y n s e t . hypernyms ( )
hyps = s y n s e t . hypernyms ( )
f o r s in s y n s e t . hypernyms ( ) :
hyps += hypernyms ( s , l e v e l �1)
re turn hyps
def has_opencyc_synonym ( c1 , c2 , o1 , o2 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
i f w1 == w2 and w1 in opencyc_db : re turn True
re turn F a l s e
def i s_opencyc_hypernym ( c1 , c2 , o1 , o2 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
re turn w1 in a n c e s t o r s ( w2 )
def i s_opencyc_hyponym ( c1 , c2 , o1 , o2 ) :
f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
re turn w2 in a n c e s t o r s ( w1 )
def g e t _ p a r e n t s ( c l s , o n t ) :
re turn o n t . o b j e c t s ( c l s , RDFS . s u b C l a s s O f )
def l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) :
l a b e l s 1 = l a b e l s _ f o r ( c1 , o1 )
l a b e l s 2 = l a b e l s _ f o r ( c2 , o2 )
re turn p r o d u c t ( l a b e l s 1 , l a b e l s 2 )
def g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :
f o r n1 , n2 in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) :
A.3. Primitives Implementation 69
f o r ( w1 , w2 ) in p r o d u c t ( n1 . s p l i t ( ) , n2 . s p l i t ( ) ) :
y i e l d w1 , w2
def l a b e l s _ f o r ( c l s , o n t ) :
l a b e l s = [ c l s . s p l i t ( " # " ) [ �1]] i f " # " in c l s e l s e [ c l s ]
i f o n t and g e t _ l a b e l s ( c l s , o n t ) :
l a b e l s = g e t _ l a b e l s ( c l s , o n t )
re turn l a b e l s
def a n c e s t o r s ( c o n c e p t ) :
i f c o n c e p t not in opencyc_db : re turn [ ]
re turn [ c f o r c in opencyc_db [ c o n c e p t ] i f c != c o n c e p t ]
def s u b _ p r e f i x ( s1 , s2 ) :
re turn s1 == " sub " + s2
def h a s _ s t o i l o s _ s i m i l a r i t y ( c1 , c2 , o1 , o2 ) :
f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) :
re turn s t o i l o s _ s i m i l a r i t y ( l1 , l 2 )
def s t o i l o s _ s i m i l a r i t y ( s t 1 , s t 2 ) :
i f s t 1 == None or s t 2 == None : re turn �1
s1 = s t 1 . lower ( )
s2 = s t 2 . lower ( )
s1 = s1 . r e p l a c e ( ’ . ’ , ’ ’ )
s2 = s2 . r e p l a c e ( ’ . ’ , ’ ’ )
s1 = s1 . r e p l a c e ( ’ _ ’ , ’ ’ )
s2 = s2 . r e p l a c e ( ’ _ ’ , ’ ’ )
70 Appendix A. Code Listings
s1 = s1 . r e p l a c e ( ’ ’ , ’ ’ )
s2 = s2 . r e p l a c e ( ’ ’ , ’ ’ )
l 1 = l e n ( s1 )
l 2 = l e n ( s2 )
L1 = l 1
L2 = l 2
i f L1 == 0 and L2 == 0 : re turn 1
i f L1 == 0 or L2 == 0 : re turn 0
common = 0
b e s t = 2
whi le l e n ( s1 ) > 0 and l e n ( s2 ) > 0 and b e s t !=0 :
b e s t = 0
l 1 = l e n ( s1 )
l 2 = l e n ( s2 )
i = 0
j = 0
s t a r t S 2 = 0
endS2 = 0
s t a r t S 1 = 0
endS1 = 0
p=0
A.3. Primitives Implementation 71
f o r i in x r a ng e ( l 1 ) :
i f not ( l 1 � i > b e s t ) : break
j = 0 ;
whi le ( l 2 � j > b e s t ) :
k = i ;
whi le j < l 2 and s1 [ k ] != s2 [ j ] : j += 1
i f j != l 2 :
p = j
j += 1
k += 1
whi le ( j < l 2 ) and ( k < l 1 ) and s1 [ k ] ==
s2 [ j ] :
k += 1
j += 1
i f k�i > b e s t :
b e s t = k�i
s t a r t S 1 = i
endS1 = k
s t a r t S 2 = p
endS2 = j
s1 = s1 [ s t a r t S 1 : endS1 ]
s2 = s2 [ s t a r t S 2 : endS2 ]
commonal i ty = 0
scaledCommon = f l o a t (2⇤common ) / ( L1+L2 )
commonal i ty = scaledCommon ;
w i n k l e r = wink l e r Improvemen t ( s t 1 , s t 2 ,
commonal i ty ) ;
72 Appendix A. Code Listings
d i s s i m i l a r i t y = 0 ;
r e s t 1 = L1 � common ;
r e s t 2 = L2 � common ;
unmatchedS1 = max ( r e s t 1 , 0 )
unmatchedS2 = max ( r e s t 2 , 0 )
unmatchedS1 = r e s t 1 / L1
unmatchedS2 = r e s t 2 / L2
# Hamacher Produc t
suma = unmatchedS1 + unmatchedS2 ;
p r o d u c t = unmatchedS1 ⇤ unmatchedS2 ;
p = 0 . 6 ; # For 1 i t c o i n c i d e s w i t h t h e a l g e b r a i c
p r o d u c t
i f ( ( suma�p r o d u c t ) == 0 ) :
d i s s i m i l a r i t y = 0 ;
e l s e :
d i s s i m i l a r i t y = ( p r o d u c t ) / ( p+(1�p ) ⇤ ( suma�
p r o d u c t ) ) ;
# M o d i f i c a t i o n JE : r e t u r n e d n o r m a l i z a t i o n ( i n s t e a d
o f [�1 1 ] )
r e s u l t = commonal i ty � d i s s i m i l a r i t y + w i n k l e r ;
re turn ( r e s u l t +1) / 2 ;
def wink le r Improvemen t ( s1 , s2 , commonal i ty ) :
n = min ( l e n ( s1 ) , l e n ( s2 ) )
f o r i in x r a n g e ( n ) :
i f s1 [ i ] != s2 [ i ] :
A.4. OpenCyc Implementation 73
break
commonPref ixLength = min ( 4 , i ) ;
w i n k l e r = commonPref ixLength ⇤0.1⇤(1 � commonal i ty )
re turn w i n k l e r
members = [ func f o r ( name , func ) in i n s p e c t . getmembers ( s y s .
modules [ __name__ ] )
i f i n s p e c t . i s f u n c t i o n ( func ) and
( name . s t a r t s w i t h ( ’ has_ ’ ) or name . s t a r t s w i t h ( ’
co un t_ ’ ) or name . s t a r t s w i t h ( ’ i s _ ’ ) ) ]
A.4 OpenCyc Implementation
Listing A.4: opencyc.py
# OpenCyc S h e l v e
# Open OpenCyc
# I t e r a t e t h r o u g h each c o n c e p t t h a t has a l a b e l
# For each c o n c e p t t h a t has a l a b e l , f i n d a l l t h e a n c e s t o r s
up t o l e v e l N
# C re a t e a d i c t i o n a r y o f them o f t h e form
# { key : [ [ p a r e n t s ] , [ g r a n d p a r e n t s ] , . . . ] }
#
# Phase two : Open OpenCyc and g e t each t h i n g t h a t has a
l a b e l , c r e a t e a
# l i s t o f t h e t r a n s i t i v e c l o s u r e over OWL. s u b C l a s s O f
# Phase t h r e e : Open OpenCyc and p a r s e i n t o l i s t o f l i s t s
based on p a r e n t s ,
74 Appendix A. Code Listings
# g r a n d p a r e n t s e t c . by do ing a bread th� f i r s t t r a v e r s a l over
OWL. s u b C l a s s O f
import r d f l i b
import s h e l v e
from r d f l i b import RDFS
import c P i c k l e a s p i c k l e
c l a s s OpenCyc ( ) :
def _ _ i n i t _ _ ( s e l f ) :
s e l f . s h e l f = s h e l v e . open ( " opencyc . s h e l v e " )
def _ _ c o n t a i n s _ _ ( s e l f , i t em ) :
re turn i t em . encode ( ’UTF�8 ’ ) in s e l f . s h e l f
def _ _ g e t i t e m _ _ ( s e l f , i t em ) :
re turn s e l f . s h e l f [ i t em . encode ( ’UTF�8 ’ ) ]
i f __name__ == " __main__ " :
g = r d f l i b . Graph ( )
g . p a r s e ( " o n t o l o g i e s / opencyc �2012�05�10� r e a d a b l e . owl " )
p i c k l e . dump ( g , open ( ’ opencyc . p i c k l e ’ , ’w’ ) )
nodes = ( n f o r n in g . a l l _ n o d e s ( ) i f g . l a b e l ( n ) )
s = s h e l v e . open ( " opencyc . s h e l v e " )
f o r n in nodes :
key = g . l a b e l ( n ) . encode ( ’UTF�8 ’ )
s [ key ] = [ g . l a b e l ( n ) . encode ( ’uTF�8 ’ )
f o r n in g . t r a n s i t i v e _ o b j e c t s ( n , RDFS .
s u b C l a s s O f )
i f g . l a b e l ( n ) ]
A.4. OpenCyc Implementation 75
s . c l o s e ( )
76
Bibliography 77
Bibliography
[1] S. K. Stoutenburg, “Advanced ontology alignment: New methods for biomedical on-
tology alignment using non-equivalence relations,” Ph.D. dissertation, University of
Colorado at Colorado Springs, 2009.
[2] L. Sweetlove. (2011) Number of species on earth tagged at 8.7 million. Nature.
[Online]. Available: http : / / www. nature . com / news / 2011 / 110823 / full / news .
2011 . 498 . html
[3] (2012, September 17th) Wikipedia live statistics page. [Online]. Available:
http : / / stats . wikimedia . org / EN / TablesWikipediaZZ . htm # distribution
[4] T. G. O. Consortium, “Gene ontology: tool for the unification of biology,” Natural
Genetics, vol. 25(1), pp. 25–9, May 2000.
[5] D. Rubin, N. Shah, and N. Noy, “Biomedical ontologies: a functional perspective,”
Briefings in Bioinformatics, vol. 9, no. 1, pp. 75–90, 2008.
[6] P. Shvaiko and J. Euzenat, “Ontology matching: state of the art and future challenges,”
IEEE Transactions on Knowledge and Data Engineering, vol. 99, 2012.
[7] J. Euzenat and P. Shvaiko, Ontology Matching. Springer, 2007.
[8] C. Smith, C. Goldsmith, J. Eppig et al., “The mammalian phenotype ontology as a
tool for annotating, analyzing and comparing phenotypic information,” Genome Biol,
vol. 6, no. 1, p. R7, 2005.
[9] D. L. McGuinness, F. Van Harmelen et al., “Owl web ontology language overview,”
W3C recommendation, vol. 10, no. 2004-03, p. 10, 2004.
78 Bibliography
[10] C. Elkan and R. Greiner, “Building large knowledge-based systems: Representation
and inference in the cyc project: Db lenat and rv guha,” Artificial Intelligence, vol. 61,
no. 1, pp. 41–52, 1993.
[11] J. David, F. Guillet, and H. Briand, “Matching directories and owl ontologies with
aroma,” in Conference on Information and Knowledge Management: Proceedings of
the 15 th ACM international conference on Information and knowledge management,
vol. 6, no. 11, 2006, pp. 830–831.
[12] J. Noessner and M. Niepert, “Codi: Combinatorial optimization for data integration–
results for oaei 2010,” Ontology Matching, p. 142, 2010.
[13] M. Hussain and S. Srivatsa, “A study of different ontology matching system,” Inter-
national Journal of Computer Applications (0975–8887) Volume, 2012.
[14] I. F. Cruz, F. P. Antonelli, and C. Stroe, “Agreementmaker: efficient matching for large
real-world schemas and ontologies,” Proceedings of the VLDB Endowment, vol. 2,
no. 2, pp. 1586–1589, 2009.
[15] N. Jian, W. Hu, G. Cheng, and Y. Qu, “Falcon-ao: Aligning ontologies with falcon,”
in In: K-Cap 2005 Workshop on Integrating Ontologies., 2005, pp. 87–93.
[16] E. Rahm, “Towards large-scale schema and ontology matching,” Schema matching
and mapping, pp. 3–27, 2011, Springer.
[17] I. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel
Software Engineering. Addison-Wesley, 1995.
[18] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,”
Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[19] M. A. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf, “Support vector
machines,” Intelligent Systems and their Applications, IEEE, vol. 13, no. 4, pp. 18–
28, 1998.
Bibliography 79
[20] J. Euzenat, A. Ferrara, L. Hollink, A. Isaac, C. Joslyn, V. Malaisé, C. Meil-
icke, A. Nikolov, J. Pane, M. Sabou, F. Scharffe, P. Shvaiko, V. Spiliopou-
los, H. Stuckenschmidt, O. Šváb Zamazal, V. Svátek, C. Trojahn, G. Vouros,
and S. Wang, “Results of the ontology alignment evaluation initiative 2009,”
http://eprints.biblio.unitn.it/1807/1/006.pdf.
[21] N. F. Noy, “Semantic integration: a survey of ontology-based approaches,” SIGMOD
record, vol. 33, no. 4, pp. 65–70, 2004.
[22] A. Doan, J. Madhavan, P. Domingos, and A. Halevy, “Ontology matching: A machine
learning approach,” Handbook on Ontologies in Information Systems, pp. 397–416,
2004.
[23] B. Smith, W. Ceusters, B. Klagges, J. Köhler, A. Kumar, J. Lomax, C. Mungall,
F. Neuhaus, A. L. Rector, and C. Rosse, “Relations in biomedical ontologies,”
Genome biology, vol. 6, no. 5, p. R46, 2005.
[24] A. Johnson and C. O’Donnell, “An open access database of genome-wide association
results,” BMC medical genetics, vol. 10, no. 1, p. 6, 2009.
[25] A. Gruzdz, A. Ihnatowicz, J. Siddiqi, and B. Akhgar, “Mining genes relations in mi-
croarray data combined with ontology in colon cancer automated diagnosis system,”
World Academy of Science, Engineering and Technology, vol. 16, no. 26, pp. 140–
144, 2006.
[26] N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D. L.
Rubin, M.-A. Storey, C. G. Chute et al., “Bioportal: ontologies and integrated data
resources at the click of a mouse,” Nucleic acids research, vol. 37, no. suppl 2, pp.
W170–W173, 2009.
[27] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J. Goldberg,
K. Eilbeck, A. Ireland, C. J. Mungall et al., “The obo foundry: coordinated evolution
of ontologies to support biomedical data integration,” Nature biotechnology, vol. 25,
no. 11, pp. 1251–1255, 2007.
80 Bibliography
[28] J. Euzenat, C. Meilicke, H. Stuckenschmidt, P. Shvaiko, and C. Trojahn, “Ontology
alignment evaluation initiative: six years of experience,” Journal on Data Semantics
XV, vol. n/a, pp. 158–192, 2011.
[29] J. Aguirre, B. Grau, K. Eckert, J. Euzenat, A. Ferrara, R. van Hague, L. Hollink,
E. Jimenez-Ruiz, C. Meilicke, A. Nikolov et al., “Results of the ontology align-
ment evaluation initiative 2012,” in Proc. 7th International Semantic Web Conference
Workshop on Ontology Matching (OM), Boston, MA, 2012, pp. 73–115.
[30] O. Bodenreider and A. Burgun, “Biomedical ontologies,” Medical Informatics, pp.
211–236, 2005.
[31] G. A. Miller et al., “Wordnet: a lexical database for english,” Communications of the
ACM, vol. 38, no. 11, pp. 39–41, 1995.
[32] T. F. Hayamizu, M. Mangan, J. P. Corradi, J. A. Kadin, M. Ringwald et al., “The adult
mouse anatomical dictionary: a tool for annotating and integrating data,” Genome
biology, vol. 6, no. 3, p. R29, 2005.
[33] N. Sioutos, S. d. Coronado, M. W. Haber, F. W. Hartel, W.-L. Shaiu, and L. W. Wright,
“Nci thesaurus: a semantic model integrating cancer-related clinical and molecular
information,” Journal of biomedical informatics, vol. 40, no. 1, pp. 30–43, 2007.
[34] A. Ghazvinian, P. Natalya F. Noy, and P. Mark A. Musen, MD, “Creating mappings
for ontologies in biomedicine: Simple methods work,” in AMIA Annual Symposium
Proceedings. San Francisco, CA: American Medical Informatics Association, 2009,
p. 198.
[35] W. Hu, Y. Zhao, D. Li, G. Cheng, H. Wu, and Y. Qu, “Falcon-AO: results for OAEI
2007,” in Proceedings of the International Workshop on Ontology Matching, Busan,
Korea, 2007.
[36] J. David, F. Guillet, and H. Briand, “Association rule ontology matching approach,”
International Journal of Semantic Web Information Systems, vol. 3, no. 2, pp. 27–49,
2007.
Bibliography 81
[37] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of
items in large databases,” in ACM SIGMOD Record, vol. 22, no. 2. ACM, 1993, pp.
207–216.
[38] P. Xu, Y. Wang, L. Cheng, and T. Zang, “Alignment results of SOBOM for OAEI
2010,” Ontology Matching, p. 203, 2010.
[39] H. Zhang, W. Hu, and Y. Qu, “Vdoc+: a virtual document based approach for match-
ing large ontologies using mapreduce,” Journal of Zhejiang University-Science C,
vol. 13, no. 4, pp. 257–267, 2012.
[40] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3,
pp. 273–297, 1995.
[41] A. J. Smola, B. Schölkopf, and K.-R. Müller, “The connection between regularization
operators and support vector kernels,” Neural Networks, vol. 11, no. 4, pp. 637–649,
1998.
[42] J. Euzenat, “Semantic precision and recall for ontology alignment evaluation,” in
Proc. 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyder-
abad, India, 2007, pp. 348–353.
[43] Z. Xuegong, “Introduction to statistical learning theory and support vector machines,”
Acta Automatica Sinica, vol. 26, no. 1, pp. 32–42, 2000.
[44] R. Neches, R. Fikes, T. Finin, T. Gruber, R. Patil, T. Senator, and W. Swartout, “En-
abling technology for knowledge sharing,” AI magazine, vol. 12, no. 3, p. 36, 1991.
[45] V. Sheng, F. Provost, and P. Ipeirotis, “Get another label? improving data quality and
data mining using multiple, noisy labelers,” in Proceeding of the 14th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. Las Vegas,
NV: ACM, 2008, pp. 614–622.
[46] M. Bernstein, G. Little, R. Miller, B. Hartmann, M. Ackerman, D. Karger, D. Crowell,
and K. Panovich, “Soylent: a word processor with a crowd inside,” in Proceedings of
82 Bibliography
the 23nd Annual ACM Symposium on User Interface Software and Technology. New
York City, NY: ACM, 2010, pp. 313–322.
[47] P. Wais, S. Lingamneni, D. Cook, J. Fennell, B. Goldenberg, D. Lubarov, D. Marin,
and H. Simons, “Towards building a high-quality workforce with mechanical turk,”
in Proceedings of Computational Social Science and the Wisdom of Crowds (NIPS),
Vancouver, BC, 2010, pp. 1–5.
[48] S. K. Stoutenburg, J. Kalita, K. Ewing, and L. M. Hines, “Scaling alignment of large
ontologies,” International journal of bioinformatics research and applications, vol. 6,
no. 4, pp. 384–401, 2010.
[49] G. Stoilos, G. Stamou, and S. Kollias, “A string metric for ontology alignment,” in
The Semantic Web–International Semantic Web Conference 2005. Galway, Ireland:
Springer, 2005, pp. 624–637.
[50] E. Loper and S. Bird, “Nltk: the natural language toolkit,” in Proceedings of
the ACL-02 Workshop on Effective tools and methodologies for teaching natural
language processing and computational linguistics - Volume 1, ser. ETMTNLP ’02.
Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp. 63–70.
[Online]. Available: http : / / dx . doi . org / 10 . 3115 / 1118108 . 1118117
[51] K. Beck, Test-driven development: by example. Addison-Wesley Professional, 2003.
[52] N. Gunther, Guerrilla capacity planning : a tactical approach to planning for highly
scalable applications and services. Berlin London: Springer, 2011.
[53] A. Gross, M. Hartung, T. Kirsten, and E. Rahm, “On matching large life science
ontologies in parallel,” in Data Integration in the Life Sciences. Springer, 2010, pp.
35–49.
[54] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image detection: min-hash
and tf-idf weighting,” in Proceedings of the British Machine Vision Conference, vol. 3,
2008, p. 4.
Bibliography 83
[55] A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy, “Learning to
match ontologies on the semantic web,” The VLDB Journal, vol. 12, no. 4, pp. 303–
319, 2003.