SCALING ONTOLOGY ALIGNMENTjkalita/work/StudentResearch/... · bility and speed of ontology...

SCALING ONTOLOGY ALIGNMENTby

RYAN E. FRECKLETON

B.S.C.p.E, University of Colorado Colorado Springs, 2008

A thesis submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulfillment of the

requirements for the degree of

Master of Science in Computer Science

Department of Computer Science

2013

c� Copyright by Ryan E. Freckleton 2013

All Rights Reserved

ii

This thesis for the Master of Science in Computer Science degree by

Ryan E. Freckleton

has been approved for the

Department of Computer Science

by

Dr. Jugal Kalita, Chair

Dr. Charles Shub

Dr. Lisa Hines

Dr. Suzette Stoutenburg

Date

iii

iv

Freckleton, Ryan E. (M.S.C.S., Computer Science)

Scaling Ontology Alignment

Thesis directed by Professor Dr. Jugal Kalita

AbstractAs ontologies become more prevalent in biomedicine and other fields, effective on-

tology alignment is a necessary for their economical and practical use. An ontology is a

group of concepts derived from a corpus of knowledge. Ontology alignment determines the

relationships between these concepts across different ontologies. Therefore ontology align-

ment is an area of active research, especially scaling ontology alignment, as the number and

size of ontologies increases dramatically.

This thesis describes an approach and implementation of ontology alignment called

Parallel Ontology Bridge, which maintains good alignment quality while increasing scala-

bility and speed of ontology alignment by matching linguistic and structural features in a

support vector machine. This approach is based on Ontology Bridge [1] and provides the

same advantages. It is able to handle non-equivalence relationships very effectively and is

a general approach to ontology alignment that can be used across many domains. Parallel

Ontology Bridge increases scalability by using map-reduce, an approach to breaking down

problems and running them in parallel. This thesis describes how this is done. Parallel

Ontology Bridge is almost two orders of magnitude faster than Ontology Bridge and shows

very good scalability while maintaining quality as measured through F-Measure.

The results of Parallel Ontology Bridge are compared against several other scalability

approaches, both with experimental data and theoretical maximum scalability. Parallel

Ontology Bridge is significantly more scalable in the experimental data and maintains this

advantage during theoretical analysis.

v

vi

To my Montessori teacher, who always knew the joy of learning and understanding.

vii

viii

Acknowledgements

I’d like to acknowledge all the people that have positively affected the creation of this

thesis. My employer, The MITRE Corporation, my coworkers, my advisory committee,

my family.

I’d like to especially thank Dr. Suzette Stoutenburg. Without her help and previous

work in this area I would not be able to create this thesis.

I’d especially like to thank my mother, Irene Freckleton and father, Grover Freckleton

for their emotional support as well as deep discussions on the concepts of ontology align-

ment and graphical presentation. I’d also like to thank my Aunt Karen, whose excitement

was infectious and way with words helped make this thesis succinct and clear.

My friend, Dr. Gregory Plett gave me incomparable help and advice with typesetting.

Tim Flink, my friend and fellow graduate student, saw architectural issues I was blind to.

My friend and colleague Dr. Norman Facas gave unparalleled advice on organization

and the appropriate layout of graphs and data.

Thank you Dr. Lisa Hines, for giving me one-on-one attention get up to speed on

biology and medicine. Dr. Charlie Shub, thank you for your continued support and fo-

cus. Your mentorship during my undergraduate studies prepared me for this thesis and my

professional career.

Finally, I’d like to thank my advisor Dr. Jugal Kalita. His expertise in artificial intel-

ligence has been unparalleled.

Without their assistance, feedback and support this would not be possible to complete.

It’s been a long, sometimes stressful journey on this path of knowledge. I appreciate all that

you’ve done for me.

Thank you.

ix

x

Table of Contents

1 Background on Ontologies and

Ontology Alignment 1

1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Ontologies Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Scaling Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.7 Focus of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.8 Background and Organization of Thesis . . . . . . . . . . . . . . . . . . . 8

2 Motivation 9

3 Survey of the State of the Art in

Ontology Alignment 12

3.1 Developments in the State of the Art . . . . . . . . . . . . . . . . . . . . . 13

3.2 Approach to Comparison and Analysis . . . . . . . . . . . . . . . . . . . . 13

3.3 Comparing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Comparison of Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4.1 LOOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.2 AROMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.3 SOBOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.4 Falcon AO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.5 Stoutenburg Ontology Bridge . . . . . . . . . . . . . . . . . . . . 17

3.4.5.1 Branch and Bound Approach . . . . . . . . . . . . . . . 18

xi

3.5 Survey Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Definitions 19

4.1 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Function Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Scaling Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.5 Ontology Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 The Strategic Approach

of Parallel Ontology Bridge 23

5.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.1 Platelet Activation . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.2 Mannose Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1.3 Immune System . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1.4 Phenylalanine Conversion . . . . . . . . . . . . . . . . . . . . . . 24

5.1.5 Bone Remodeling . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1.6 Bone Marrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.7 Osteoblast Differentiation . . . . . . . . . . . . . . . . . . . . . . 25

5.1.8 Osteoclast Differentiation . . . . . . . . . . . . . . . . . . . . . . 25

5.1.9 Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.10 Circadian Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2 Attempts to Enhance Alignment . . . . . . . . . . . . . . . . . . . . . . . 25

5.2.1 Parallel Human Computation . . . . . . . . . . . . . . . . . . . . . 26

5.2.2 Information Entropy and Morpheme Based Extraction . . . . . . . 26

5.3 Summary of Parallel Ontology Bridge . . . . . . . . . . . . . . . . . . . . 27

5.3.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.4 Aligning Ontologies with MapReduce . . . . . . . . . . . . . . . . . . . . 31

5.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.6 Comparison to Ontology Bridge . . . . . . . . . . . . . . . . . . . . . . . 33

5.7 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xii

5.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.8.1 Issues With Java Implementation . . . . . . . . . . . . . . . . . . . 38

5.9 Parallel Ontology Bridge F-Measure . . . . . . . . . . . . . . . . . . . . . 39

5.10 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Scalability Results 41

6.1 Scalability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Scalability Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.3 Other Systems Compared . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.3.1 Gross’s Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.3.2 Zhang’s Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.4 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.5 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.6 Comparison of Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.7 Summary of Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Conclusion 51

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A Code Listings 54

A.1 Align Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.2 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.3 Primitives Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.4 OpenCyc Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Bibliography 77

xiii

xiv

List of Tables

2.1 Biomedical Ontology Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Results From Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1 Human Computation Experiment . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Chunking (segmenting strings based on information entropy) Results . . . . 28

5.3 Example Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.5 Unit Test Statement Coverage. This shows how much code in each one of

these modules is covered by automated unit tests. . . . . . . . . . . . . . . 38

5.6 Cross-Validation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2 Architecture Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Scalability Metrics Comparison . . . . . . . . . . . . . . . . . . . . . . . 49

6.4 Speed of Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

xv

xvi

List of Figures

1.1 Gene Ontology. The different shading represents different subdomains. . . 3

1.2 Mammalian Phenotype Ontology . . . . . . . . . . . . . . . . . . . . . . . 4

4.1 Precision and Recall

Public Domain image from WikiMedia . . . . . . . . . . . . . . . . . . . . 20

5.1 Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Performance of Ontology Bridge variants at various numbers of ontology

pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1 Other Systems Curve Fits . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.2 Comparison of scalability of ontology alignment approaches based on data

points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.3 Comparison of extrapolated scalability of ontology alignment approaches . 47

xvii

xviii

CHAPTER 1

Background on Ontologies and

Ontology Alignment

1.1 Purpose

Modern civilization is based on a dynamic and changing foundation of information. There

are 8.7 million species of lifeforms cataloged [2] as well as 19 million articles on Wikipedia

[3] in various languages and an endless number of entries in ontologies. Some of these fast

growing ontologies are in the biomedical field. An ontology is a group of concepts derived

from corpus of knowledge.

Information about biomedicine is being updated on a daily basis [4]. At this rate

the amount of information continues to increase and it is becoming more difficult for re-

searchers to make sense of it [5]. Computer based reasoning holds great promise of a

solution to this problem of increasing information by allowing inferences and deductions

about information and knowledge. To do this economically and practically, it is necessary

to coordinate the effective development and reuse of ontologies. New tools are needed to

meet these new challenges [5].

One of these new tools is ontology alignment. Ontology alignment relates two existing

ontologies to each other. A general approach to ontology alignment is Ontology Bridge,

described by Stoutenburg in [1]. Ontology Bridge finds multiple relationships between

ontologies. Overlapping relationships and concepts are linked by semantic bridges based

on linguistic features, semantic information and structure. It gives good results, but can

be slow for large ontologies. This thesis builds on Ontology Bridge and increases the

scalability and speed of execution. The implementation of this thesis is called Parallel

Ontology Bridge.

2 Chapter 1. Background on Ontologies and Ontology Alignment

Parallel Ontology Bridge improved upon the performance of Ontology Bridge by a

factor of 19 to 48, depending on the test, while maintaining F-Measure. There are few

approaches to scaling ontology alignment [6]. The scalability of Ontology Bridge was in-

creased in [1] by scalability through a branch and bound optimization. In contrast, this

thesis uses parallel execution to increase scalability. The results achieved with this paral-

lelization were so good that it was felt that using branch and bound would not be beneficial,

reducing the quality of alignment for no significant gain in speed.

There are many purposes for ontology alignment [7]. By finding relationships between

objects in disparate models, ontology alignment can be used for data integration. Agents

that use logic programming can use ontology alignment to learn about new domains and

integrate with systems. Aligned ontologies provide a basis for extracting information from

natural language documents.

Larger ontologies allow for more precise and detailed models of the world, which is

the limiting factor in many of these application areas. Ontology alignment also enables

interoperability of independently developed systems by providing an accurate shared se-

mantic vocabulary [7].

As the size of ontologies being aligned increases, the computations necessary grows

quadratically. Therefore, effective approaches to scalability are paramount. The purpose of

this thesis is to provide a solution to scaling ontology alignment.

1.2 Ontologies Used

The two ontologies used in this thesis are the Gene Ontology (GO) [4] (Figure 1.1) and

Mammalian Phenotype Ontology (MP), [8] (Figure 1.2). As of the publication of this

thesis, it is estimated that the Gene Ontology consists of 32,481 OWL [9] classes. The MP

ontology consists of 6,516 ontology classes. OpenCyc, a “universal” ontology, consists

o116,822 OWL classes.

1.2. Ontologies Used 3

Figure 1.1: Gene Ontology. The different shading represents different subdomains.


Figure 1.2: Mammalian Phenotype Ontology

The entirety of OpenCyc1 [10] was used as an upper ontology in Parallel Ontology

Bridge. To reduce necessary execution time for the majority of testing 1,000 class subsets

were used of the Gene Ontology and Mammalian Phenotype Ontology.

The Gene Ontology is a dynamic, controlled vocabulary for eukaryotic cells. It is

actively updated as daily discoveries are made. By gathering and sharing information on

the common genes and proteins, the hope is to help the grand unification of biology, which

is the understanding of all organisms. The information in GO provides strong inference to

the functions of other organisms. The goal is to enable the annotation of the genomes of all

organisms using a shared system of nomenclature and understanding.

1http : //www.cyc.com/platform/opencyc is an ontology and reasoning engine that aims

to cover “common sense”. It defines concepts such as physical, temporal and conceptual entities and therelationships between them.

1.3. Example Applications 5

Individual species and organisms are not represented in GO, nor are specialized or-

gans or body parts. Instead, knowledge from GO is transferred to these specific contexts

through the use of species and anatomical databases. This transfer can be aided by ontology

alignment.

The Mammalian Phenotype Ontology provides a computationally accessible way to

annotate phenotype information to individual genotypes through a shared vocabulary to

describe concepts. Since annotations require more than simple vocabularies, it is imple-

mented as an ontology. Phenotype information tends to be complex and incomplete, so

these constraints are handled directly by MP.

Both these ontologies are continuously updated and exist in complimentary, but sepa-

rate domains. Alignment of these two ontologies will enable further and better annotation

of genotype and gene expression information. Alignment will also benefit other uses of

these ontologies, such as hypothesis generation and collaboration.

1.3 Example Applications

Examples of small-scale ontology alignment include database migration and interoperabil-

ity between different systems. For example, a library that has records using the Dewey

Decimal System needs to be properly aligned to a library using a different system, such

as the Library of Congress Classification. One approach of solving this problem would be

to create an ontology describing each system and match them so that concepts could be

translated between the systems.

1.4 Issues

Often ontology alignment occurs at a small scale and is assisted by humans. This approach

is uneconomical for the large ontologies in the biomedical field. There already exists meth-

ods for matching ontologies [7]. Some use machine learning such as AROMA [11], which

finds associations, and Codi [12], which uses Markov logic to detect similarities [13]. Oth-

ers use natural language processing, such as AgreementMaker [14], which finds lexical


similarity and FALCON-AO [15], which uses statistics on virtual documents [13]. Large

ontologies can take hours or days to complete alignment [16]. Doing ontology alignment

by human means is unfeasibly expensive and time consuming.

1.5 Scaling Ontology Alignment

Generally, the number of operations needed to align two ontologies, o1 and o2, grows

O (m ·n) where

m = |o1|

n = |o2| .

The number of concepts in each ontology has to be compared to concepts in the other

ontology to determine relationships. Using the sizes of ontologies described in Section 1.2,

gives a size on the order of

8⇥ 10

6 · 6⇥ 10

5 ⇡ 5⇥ 10

12

comparisons. Even if each comparison took only one microsecond, that would take ap-

proximately 60 days of processing.

Since each concept pair in the alignment can be compared independently, this is

amenable to a divide and conquer approach. First, using traditional branch and bound tech-

niques as described in [1] and secondly, standard distributed computing and parallelization

[17]. This thesis focuses on the latter technique, reducing the processing time by executing

the necessary comparisons in parallel. Parallel computation is now especially attractive

because of the large amount of cloud computing resources available. The system used in

this thesis was rented on Amazon Web Services for a total of $512.44 in cost and 172.8

hours of execution time.

1.6. Achievements 7

1.6 Achievements

The creation of Parallel Ontology Bridge achieved several advances to the state of the art.

First, it is very scalable because it can take advantage of many processing units. The results

described in this thesis use a 16 core machine, but it is theoretically scalable up to dozens

of processing units.

Secondly, it is quite fast. Parallel Ontology Bridge can process 25,000±3000 pairs per

minute on the 16 core processor system. At this speed, it is able to complete the ontology

alignment of the GO and MP ontologies, which has 210 million possible pairs, in 5.8 days.

Due to a fault in concurrency implementation, this entire run was not totally finished, but

enough results exist to show the scalability and validity of the approach.

The hardware used to run these tests was from Amazon’s new High Performance

Computing2 services, showing that cloud computing is an amenable fit to scaling ontology

alignment. A single instance with 16 cores was created on the Amazon Elastic Computer

Cloud, the software uploaded and installed and the tests executed.

The method of parallelization used, dividing the problem using MapReuce [18] with

some shared resources for ontology graph lookup is simple and unique in this domain.

The details of these results are elaborated in Chapter 6.

1.7 Focus of This Thesis

Ontology Bridge, described in [1], used feature extraction, upper ontologies and support

vector machines [19] to align ontologies. Scalability was done through a branch and bound

algorithm which sacrificed recall for speed.

Parallel Ontology Bridge, described in this thesis, uses the same approach to aligning

ontologies. However, instead of a branch and bound technique, Parallel Ontology Bridge

uses parallelism to increase speed of execution without sacrificing alignment quality.

Ontologies may have many types of relationships within them. Some items in ontolo-

gies are stated to be equivalent to each other, these are like synonyms. Non-equivalence

2http : //aws.amazon.com/hpc-applications/


relationships include all others, such as hypernymy, hyponymy, antonymy or relationships

that are ontology specific.

The algorithm described here is done using non-equivalence relationships. It can be

expanded to equivalence relationships with no loss of generality.

Stoutenburg’s work also described matching relationships defined in the ontologies.

For simplicity, these ontology defined relationships were omitted in this work, only hy-

ponymy relationships were used.

1.8 Background and Organization of Thesis

The work described in this thesis is a novel and practical approach to scalable ontology

alignment in the biomedical domain. First, the motivation for this work will be explained,

then the state of the art will be reviewed. Finally, the novel work will be described. The ap-

proaches described in this thesis provides a scalable, fast and accurate ontology alignment

technique. This will be illustrated by examples in the biomedical domain.

CHAPTER 2

Motivation

There are many purposes for ontology alignment. According to the seminal source for

this topic, [7], some of the most interesting relate to better reasoning over ontologies and

integrating disparate data sources. Logic programming and artificial intelligence require a

good model of a domain to work effectively. As such, they are limited by the accuracy of the

ontologies they use. Better ontologies make logic programming and artificial intelligence

effective and usable.

More complete ontologies enable other applications. Some examples of these ap-

plications are machine translation, strong artificial intelligence and ontology extraction.

Ontology alignment also enables interoperability of independently developed systems.

By providing a shared semantic vocabulary, ontology alignment allows for the trans-

lation of data and information between systems. A simple example of this is two systems

with different schemas. If the schemas are matched, they can interoperate by translating

the appropriate fields.

Because ontology alignment is generally an O (n2) problem, scaling ontology align-

ment is important [1]. As described in Chapter 5, the alignment process can be easily

broken into parallel pieces. Biomedical ontologies are especially affected by scaling, since

they tend to be large. Some biomedical ontologies and their sizes are given in Table 2.1.

If a scalable approach to ontology alignment is not found, they cannot be aligned in a

reasonable amount of time [20].

Alignments must be high quality to be applied effectively. If mistakes are included

in the ontology it will cause incorrect conclusions to be reached. Performance, scalability

and precision are some of the key measures of quality for an ontology alignment algorithm.

Performance is important because the ontologies must be aligned in a reasonable amount

of time.

10 Chapter 2. Motivation

Ontology SizeInfluenza Ontology 1,368

Mammalian Phenotype 130,268Gene Ontology 878,379NCI Thesaurus 1,758,354

Table 2.1: Biomedical Ontology Sizes

Specific use cases of ontology alignment include, but are not limited to:

• Catalog integration [7], offering products from different vendors on a single portal,

• Data integration [21], combining data sources into a single view for consumption,

• Data extraction from biomedical texts using natural language processing [5], for in-

stance, building biomedical ontologies from research papers or textbooks,

• Peer-to-peer information sharing [22], such as between online agents which solve

problems autonomously,

• Inference on biomedical information [23], like bioinformatics prediction,

• Data exchange among biomedical applications [24], for example health care databases,

• Computer reasoning with biomedical data [5], such as hypothesis generation,

• Decision support [25], such as automatic diagnostics,

• Federated databases[7], integrating multiple databases from different enterprises, and

• Encyclopedic knowledge [5][7], for example annotated Wikipedia and DBpedia.

The number of biomedical researchers interested in biomedical ontology has been rapidly

expanding and it is difficult to make sense of the new biomedical information available [5].

Practitioners hope that ontology alignment tools will be incorporated into BioPortal [26], a

website with many ontologies, and similar resources.

Ontologies could enable the large majority of data produced by the spectrum of life

sciences to be easily retrieved and understood by those working in these fields [27][5].

Chapter 2 Motivation 11

These benefits are similar to what happened in chemistry after the introduction of the pe-

riodic table. Scientists used the same symbols and categorizations for the elements so

researchers could understand the experiments. With ontologies biomedical information

similarly can be understood with a universal model allowing measurement and prediction

across biomedical sub-domains. Ontology alignment is one of the tools needed to handle

this huge task [5].

12

CHAPTER 3

Survey of the State of the Art in

Ontology Alignment

3.1 Developments in the State of the Art

In the past few years there have been great advances in ontology alignment. Scalability

has dramatically improved. Various research groups are increasing their efforts to align

biomedical ontologies [20].

Previously, many automatic ontology matchers took hours or days to align larger on-

tologies, in contrast, modern systems, such as Falcon-AO, only take minutes to complete

[6]. Fifteen different research groups took part in the OAIE1 2012 large scale ontology

tests, more than twice the number of total participants in the 2004 OAIE [28][29].

Only recently larger scale tests, those with tens or hundreds of thousands of items

for ontology alignment, have been created. This is still an area of active research and

development [29].

Over the years both the accuracy and performance of large-scale ontology alignment

have improved [28]. This is crucial because the size and number of ontologies used by

biomedical researchers continues to increase rapidly [28].

3.2 Approach to Comparison and Analysis

Ontology alignment is similar to information retrieval in that the quality is subjectively

measured, there is no perfect mathematical definition of correctness [28]. Because of this,

1The Ontology Alignment Evaluation Initiative – An annual “competition” for ontology alignment algo-rithm researches described in [28]

14 Chapter 3. Survey of the State of the Art in Ontology Alignment

there can be multiple correct alignments for two given ontologies. For the purposes of

the OAEI, it is assumed that there exists a unique and ideal reference alignment between

any two ontologies [28]. Ontologies have distinct and sometimes contradictory ways of

classifying data [30]. This makes it a challenge to align ontologies, especially when they

are designed for different purposes. For example, WordNet [31] does not relate the words

“renal” and “kidney” directly, but uses a special relationship called “pertainymy” to connect

them [30].

Various ontology alignment techniques have been tested through the Ontology Align-

ment Evaluation Initiative since 2007. Recently they have started looking at scalability

[28]. Table 3.1 shows a comparison of precision, recall and runtime for some of the algo-

rithms compared below. These algorithms generally ran on 2.0 to 3.1 GHz processors with

2 to 4 GB of RAM. These were run against the Adult Mouse Anatomy (2744 classes) [32]

and the NCI Thesaurus (3304 classes) [33] except when otherwise noted [20].

3.3 Comparing Approaches

The approaches described in Table 3.1 vary from heuristic approaches to machine learning.

Some of them are very well described, while others have details which are opaque in the

published literature [20]. In addition to the variation in approach, the ontologies and rela-

tionships used to test alignment also varied. This table gives a rough understanding of the

diversity of results, performance and approaches in ontology alignment.

Most of the state of the art systems takes less than an hour to do equivalence relation-

ship alignments on the OAEI O (3,000 · 3,000) test case. Precision tends to be much higher

than recall, from 0.77 to 0.99. Recall is around 0.52 to 0.77, depending on the algorithm.

3.4 Comparison of Systems

The following subsections describe several ontology alignment projects selected for their

high accuracy, simplicity or uniqueness of approach. They represent a summary from the

state of the art for biomedical ontology alignment.

3.4. Comparison of Systems 15

Table 3.1: Results From Literature

Project Precision Recall Runtime Ontology SizeLOOM 0.99 0.65 ? O (3,000 · 3,000)[34]

AROMA 0.775 0.678 ~1 minute O (3,000 · 3,000)[20]SOBOM 0.952 0.777 19 minutes O (3,000 · 3,000)[20]

Falcon AO 0.964 0.591 12 minutes O (3,000 · 3,000)[35]Stoutenburg (super) 0.84 0.55 96 hours O (1,000 · 1,000)[1]

Stoutenburg (sub) 0.93 0.54 96 hours O (1,000 · 1,000)[1]Stoutenburg (ontology defined) 0.62 0.52 96 hours O (1,000 · 1,000)[1]

3.4.1 LOOM

Described in [34], LOOM is a simple approach to ontology alignment that uses string

normalization and string comparison, producing highly precise results with good recall.

Remarkably, this seemingly naive approach provides better results than some approaches

that use machine learning. This is likely due to the ontologies selected, the Adult Mouse

Anatomy and a part of the NCI Thesaurus. These ontologies have been going through a

process of label harmonization which increased the correlation of concepts within them2.

Its simplicity and precision makes this an interesting and practical approach for align-

ing biomedical ontologies. The LOOM approach compares ontology classes using string

comparison in these two steps:

1. Normalize ontology class titles by removing all delimiters from strings (spaces and

punctuation) and normalize case.

2. Match strings approximately. Allow for a mismatch of no more than one character in

strings with length greater than four and no mismatches for shorter strings.

This heuristic (2) can be replaced with an exact string match to boost precision. In specific

instances precision was much higher than the OAEI reference alignment. Precision is the

strength of this algorithm.

2http : //oaei.ontologymatching.org/2012/anatomy/


3.4.2 AROMA

AROMA [36] compares vocabularies used to describe ontologies through statistical analy-

sis. It measures the number of words used to describe concepts. If a concept, A, is described

with a subset of the words used to describe B, that implies that A is a more generic type of

B.

These relations are found through association mining [37], an unsupervised machine

learning algorithm that finds association rules. Association rules are inferences about what

concepts are found together and which concepts imply other concepts. For an example, an

association rule would be in the form

{French fries, shakes} ) Hamburger

states that when people buy French fries and shakes at a restaurant they also buy hamburg-

ers.

These association rules are filtered using implication intensity, a measure of the num-

ber of expected and observed counter-examples [36]. This method is capable of finding

hyponymy and hypernymy relationships.

This method is incredibly fast, however it does not have outstanding performance

for either precision or recall. It is also one of the few methods besides [38] and [1] that

describes hyponymy and hypernymy matching.

3.4.3 SOBOM

SOBOM [38] uses a series of steps for alignment. Firstly, “anchor concepts” are found

between the ontologies. These are concepts that have precise equivalence and are not leaf

nodes. Using these anchors as roots, sub-ontologies are segmented and aligned. These

sub-ontologies are matched using a similarity propagation graph.

Secondly, additional semantic information is used to align non-superclass relation-

ships. The details of these non-superclass relationship matches are not clearly explained or

cited.

3.4. Comparison of Systems 17

This method gives impressive results. The details of how this is accomplished is not

elaborated in the paper. Whether this is a domain specific approach or can be used in other

areas is not known. This is one of the few methods besides [36] and [1] that describes

hyponymy and hypernymy matching.

3.4.4 Falcon AO

Falcon AO [15] combines graphical and linguistic methods for matching. First good align-

ments are created on some objects, these are then expanded to match other items.

This allows for a partitioning approach to scalability, reducing the number of compar-

isons as large ontologies are aligned.

Some work has been done on using the VDoc algorithm of Falcon AO with a MapRe-

duce framework to increase scalability further [39].

3.4.5 Stoutenburg Ontology Bridge

The Stoutenburg Ontology Bridge algorithm [1] uses a combination of support vector ma-

chines (SVMs) [40], upper ontologies and natural language processing.

Pairs of concepts between the ontologies are enumerated and compared. Approxi-

mately two dozen features are extracted from each pair. These features are compared by

using a radial-basis function SVM [41] which infers what relations exists between the con-

cepts in the pair. The relationships supported are hyponymy, hypernymy and ontology

defined relations.

Using SVMs has several drawbacks. They are relatively slow and there are only a few

implementations available. Also, they require training and parameter tuning. The results

from an SVM cannot be explained. A numerical value is provided and normalized, but it

doesn’t necessarily reflect the quality of matches.

Upper ontologies enable better matching by mapping the meaning of labels to deep

semantic information. This allows for “common sense” reasoning about the ontologies as

they’re being aligned. Finding relations in these upper ontologies is often slow because of

their size and complexity. The OpenCyc project [10] software is especially complex and

memory intensive.


3.4.5.1 Branch and Bound Approach

In addition to the primary approach above, [1] describes a branch and bound algorithm for

scaling ontology alignment. This approach can trade recall for time so it allows for high

precision alignments with reduced execution time. This branch and bound algorithm relies

on the ontologies being well structured in order to select ontology pairs to compare and

align.

3.5 Survey Results

There still are only a few ontology alignment systems that handle non-equivalence relation-

ships, specifically Ontology Bridge, SOBOM and AROMA. The methods used for align-

ment have a large diversity of approaches. Some systems, such as Falcon-AO, combine

multiple approaches, which seems to give better results.

CHAPTER 4

Definitions

This chapter defines the technical terms used in this thesis.

4.1 Performance Measures

Ontology alignment uses precision, recall and F-measure to determine quality [42]. Algo-

rithms produce results which are compared against a “reference alignment”. A reference

alignment is a ontology alignment which has been verified to be correct.

In Figure 4.1, the filled in dots are all relevant pieces of information while the retrieved

information is within the oval. Errors are shown in gray.

Recall Measures how many of the relevant relationships the algorithm found. In Figure

4.1, it is denoted by the R$, the white oval area divided by the gray area to the left.

Precision Measures that the alignments found are relevant. In Figure 4.1, it is denoted by

P$, the white oval area divided by the gray oval area.

F-Measure Measures overall quality. It is the harmonic mean of precision and recall

F = 2

precision · recallprecision+ recall

.

20 Chapter 4. Definitions

Figure 4.1: Precision and RecallPublic Domain image from WikiMedia

Defining these mathematically gives

Recall (A,R) =

|R \ A||R|

and

Precision (A,R) =

|R \ A||A|

where A is the set of alignments detected from the ontology alignment system (the dots and

circles within the oval in Figure 4.1) and R is the set of true alignments (the black dots on

the right hand side in Figure 4.1).

4.2 Tools

SVM Support Vector Machine [43]. This is a machine learning algorithm that creates a

high dimensional space based on many features. Using training data, it creates a

maximum margin hyperplane between the classes it is to discriminate between. For

this thesis, a two class support vector machine is used. A SVM reports the distance

from the separating margin between the two classes, in some cases this can be used

to gauge how likely it is to be correct. For this thesis, the distance from the margin is

discarded during classification. The report from the SVM is only used to determine

which side of the margin the class is on.

4.3. Function Primitives 21

Upper Ontology An ontology that describes abstract or broad terms that can be used

across contexts.

4.3 Function Primitives

map A function that takes in another function and executes it on all the items of a sequence.

map(f, [a,b,c, ...]) = [f (a) , f (b) , f (c) , ...]

reduce A function that takes in another function and executes it on a sequence, taking the

previous result as the operand. reduce(f, [a,b,c, ...]) = f (a, f (b, f (c, ...)))

product A function that produces the Cartesian Product of two sequences.

product ([a, b, c..], [A,B,C, ...]) = [(A, a) , (A, b) , (A, c) , (B, a) , ...]

4.4 Scaling Nomenclature

parallelism Running multiple calculations concurrently on separate processors to reduce

execution speed.

parallelization Changing operations to work in parallel.

4.5 Ontology Nomenclature

Ontology An ontology consists of a vocabulary that describes a specific domain and the

definitions of the terms in that vocabulary in a formal manner [7]. Ontologies model

entities, assign their significances and group them based on relationships [44][30].

Ontology Alignment The process of creating a set of correspondences between ontologies

is called ontology alignment [7]. Concepts in each ontology are related to one another

by equivalence, hyponymy, hypernymy or other relations. This process is called

schema matching when it is done with format schemas instead of ontologies [7].

22 Chapter 4. Definitions

Hyponymy The relationship of being more specific: “Dog is a hyponym of animal.”

Hypernymy The relationship of being more general: “Animal is a hypernym of dog.”

MP Mammalian Phenotype Ontology1. This ontology covers mammalian phenotypes.

Phenotypes are the physical attributes of organisms. Most of its data is based on

experimental results from mice and rats. These experiments are on organisms that

have been selectively bred, genetically engineered or mutated to show certain traits.

GO Gene Ontology2. This ontology covers general information about gene expression,

metabolism, and cellular processes. It can be used to annotate results from bioinfor-

matic experiments. Much of the information in the Gene Ontology has come from

comparing the genomes of various organisms.

1http : //www.informatics.jax.org/searches/MP_form.shtml

2http : //www.geneontology.org/

CHAPTER 5

The Strategic Approach

of Parallel Ontology Bridge

This chapter discusses the approach to ontology alignment used in the work of this thesis.

It covers the experimental data, the algorithms and the implementation. Comparison with

Ontology Bridge from [1] is emphasized.

Parallel Ontology Bridge has the same general form as Ontology Bridge. Features are

extracted, SVMs are used and non-equivalence relations are detected. The main difference

is that Parallel Ontology Bridge executes these steps across multiple, concurrently running

jobs.

5.1 Test Data

The biomedical reference alignments generated for [1] were used in development and test-

ing. They provided the training data used throughout this thesis and were used to determine

the quality of alignment. These test cases are relatively small, with around 10 concepts

from each ontology. The hyponymy reference alignments were used for the testing of this

work. There are 10 biomedical test-cases covering various concepts and the relationships

between them. These concepts are as follows:

5.1.1 Platelet Activation

The conditions under which platelets cohere to one another and related activities, i.e. scab-

bing. This can be triggered by various events, such as the platelet encountering collagen or

other proteins. As a platelet activates, it changes shape to a more amorphous form, adheres

24 Chapter 5. The Strategic Approach of Parallel Ontology Bridge

to other platelets and promotes coagulation reactions. This test case consists of 10 concepts

from the GO ontology such as cell activation, blood coagulation and platelet activation and

4 concepts from the MP ontology such as abnormal platelet physiology and hematopoietic

system phenotype.

5.1.2 Mannose Binding

The process of certain proteins binding to the surfaces of pathogen. Deficiency of mannose

binding is associated with higher rates of infection. This test case consists of six concepts

in GO relating to binding and 6 concepts in MP relating to immune system and protein

physiology.

5.1.3 Immune System

The system of the body which fights pathogens and foreign elements. This test had eight

concepts concepts related to immune response in GO while the seven MP concepts relate

to immune system physiology, response and phenotype.

5.1.4 Phenylalanine Conversion

The processes of converting the amino acid phenylalanine into other amino acids, such as

tyrosine. This test had 18 concepts from GO related to metabolic processes were selected,

while 6 concepts related to activity were selected from MP. These require semantic features

to appropriately match.

5.1.5 Bone Remodeling

The process of mature bone tissue being removed from the skeleton and new tissue being

formed in its place. This is done, respectively, by osteoclasts and osteoblasts. This test had

9 concepts related to regulation, resorption and remodeling in GO and 6 in MP related to

remodeling, physiology and increase/decrease of resorption.

5.2. Attempts to Enhance Alignment 25

5.1.6 Bone Marrow

The tissue inside of bones which creates various cells and components of blood. This test

case has 5 concepts in GO related to development and morphogenesis and 4 concepts in

MP related to development and morphology.

5.1.7 Osteoblast Differentiation

How the cells that create bone tissue are created. This test has 7 concepts from GO about

osteoblast differentiation, ossification and regulation of osteoblast differentation. Only one

concept from MP was selected, abnormal osteoblast differentation.

5.1.8 Osteoclast Differentiation

How the cells that destroy bone tissue are created. This test has 4 concepts from GO about

regulation of osteoclast differentation and osteoclast differentation. One concept from MP

was used, abnormal osteoclast differentation.

5.1.9 Behavior

The actions of an organism in response to stimulus. The concepts in both ontologies were

generic, 4 concepts in GO related to behavior and regulation of behavior and 2 concepts in

MP: abnormal behavior and behavior phenotype.

5.1.10 Circadian Rhythm

Processes that have a daily cycle. In GO these concepts were circadian rhythm, regulation

of circadian rhythm and response to external stimulus. In MP the single concept selected

was abnormal circadian rhythm.

5.2 Attempts to Enhance Alignment

In addition to the primary thrust of increasing scalability, this thesis did some work on

novel ways to enhance alignment quality. These results are described in this section.


5.2.1 Parallel Human Computation

In human computation, sometimes called crowd-sourcing, a large number of people solve a

problem by connecting and collaborating through the Internet. A problem is broken down

into pieces that can be done in small increments by many participants.

This approach was attempted for ontology alignment. The Mechanical Turk1 service

of Amazon Web Services was used. Mechanical Turk offers a cheap crowd-sourcing plat-

form for companies and individuals to manage tasks that can be delegated to users across

the world. The labels and descriptions of potential matches were given to independent

participants who selected what relationships existed between the pairs.

These experiments gave unsatisfactory results. The results were much worse than

random selections of alignments. Only one participant out of twenty may get an alignment

correct. This is likely due to lack of expertise in the biomedical domain for Mechanical

Turk participants. Unlike in simple domains, the population does not converge on a correct

result. The details of this experiment are given in Table 5.1. Participants took less than a

minute on average to determine results, and were given a 1 cent bounty. These tasks were

based on the MP and GO ontology test cases from [1], which had full results to compare

against.

There are several approaches that may help with this. Using confusion matrices [45],

an approach to noisy classifiers, has had good results in other domains. Breaking the prob-

lem into very small tasks, which are used as input to an SVM, may be helpful. This is

similar to the approach used in [46] for crowd-sourced elaborate editing tasks and [47]

which uses crowd-sourcing for training an SVM.

5.2.2 Information Entropy and Morpheme Based Extraction

An information entropy and morpheme based extraction was also attempted. Inspired by

[34], ontology concept labels were compared, weighing the characters by their information

entropy. Based on this entropy, they were grouped into “chunks” that were compared.

Words in titles were broken into chunks based on information entropy. Entropy measures

1https : //www.mturk.com/

5.3. Summary of Parallel Ontology Bridge 27

Table 5.1: Human Computation Experiment

Variable ValueNumber of participants 200

Average time per task 47 secondsBounty per match $0.01

Matches per participant 1Total number of pairs 200

how informative characters or strings of characters are in predicting the rest of the title.

This was not very successful, as can be seen in Table 5.2 on the following page.

The baseline results are F-Measures for the test cases used on the approach described

by Stoutenburg [1]. Top 4 Stems shows the performance of creating a feature vector based

on the top 4 most distinct stems extracted from the titles and descriptions. Chomp shows

extractions of phonemes at various cut-off levels.

? is shown for the cases where either the retrieved documents or relevant documents

retrieved were zero. The baseline is the F-Measure based on the existing features, without

the addition of the morpheme based extraction.

5.3 Summary of Parallel Ontology Bridge

The non-optimized approach of Ontology Bridge described in [48] can be summarized as

follows:

1. For the two ontologies being aligned, the classes were paired from each ontology.

This creates a Cartesian product of classes between the two ontologies.

2. For each pair of classes, features were extracted.

(a) Features include information in upper ontologies, linguistic features of labels

and structural features of ontologies. These features are all numerical in nature.

(b) These features were normalized into feature vectors for use in a radial-basis

function SVM.


Table5.2:C

hunking(segm

entingstrings

basedon

information

entropy)Results

TestCase

Baseline

Top4

Stems

Chom

pC

ount1C

homp

Count2

Chom

pC

ount3PlateletA

ctivation0.707

?0.717

0.5130.717

Mannose

Binding

0.685?

??

?

Imm

uneSystem

0.6230.639

??

?

PhenylalanineC

onversion0.424

?0.164

0.4480.448

Bone

Modeling

0.709?

??

?

Bone

Marrow

0.692?

?0.513

?

OsteoblastD

ifferentiation0.707

??

?0.717

OsteoclastD

ifferentiation0.707

??

??

Behavior

0.700?

0.710?

0.710C

ircadianR

hythm0.700

?0.513

0.7100.710

5.3. Summary of Parallel Ontology Bridge 29

3. Based on these features, pairs that have relationships were selected using an SVM.

The SVM has two classes for each relationship, that the relationship exists or it does

not exist.

Parallel Ontology Bridge, the implementation of this thesis, re-implements the approach

above with the additional enhancement of running feature extraction and pair selection with

an SVM on multiple processors. This dramatically increases scalability. For additional

performance, some feature calculations are optimized by storing precomputed information

in lookup tables.

5.3.1 MapReduce

Parallel Ontology Bridge is an implementation of the Ontology Bridge algorithm which

runs in parallel over multiple processing units. It does this by using MapReduce. MapRe-

duce is described in [18].

The name MapReduce comes from the two concurrency primitives used during exe-

cution. The function map takes two parameters, a function and a sequence. It runs the

function on every item in the sequence. The function map can be executed on multiple

processors with little effort. Input data is partitioned, scheduled and executed on a number

of workers. This is carried out by a master process which manages workers and assigns

tasks. Each worker receives input data and processes it using the function passed into map.

Similarly, reduce also has parameters that are a function and a sequence. But unlike map,

reduce returns a single value by executing the function over the pairs of the sequence.

Here MapReduce is illustrated with an example analysis of a sum of squares expres-

sion. The sum of squares is used when calculating various quantities in statistics. In com-

pact notation it is

S =

NX

i

x2i .

Calculating this by hand can be done by “unrolling” the summation and doing each

operation in sequenceNX

i

x2i = x2

1 + x22 + x2

3 + ...+ x2N .


This expression has a few interesting properties. One, each squaring of the terms is inde-

pendent of all the others. We could give each x2i term to a separate processor and then add

them together.

The other property of note is that items can be added in any order. We could divide

the problem into pieces such as

NX

i

x2i =

⇣x21 + x2

2 + ...+ x2

bN2 c

⌘+

⇣x2

bN2 c+1

+ ...+ x2N

⌘.

This means the problem can be broken into small pieces and executed concurrently

and reassembled easily with no change in the calculations that have occurred. MapReduce

does this by rewriting the problem in terms of two higher order functions. Higher order

functions are functions which take functions as arguments. For the squaring, rewriting the

sequence⇥x21, x

22, x

23, ..., x

2N

⇤

gives

f (x) = x2

)⇥x21,x

22,x

23, ...,x

2N

⇤= [f(x1),f(x2),f(x3), ...,f(xN)]

= map(f, [x1,x2,x3, ...,xN ]) ,

an expression in terms of map.

Similarly, the addition can be rewritten in terms of reduce

g (a,b) = a+ b

) x1 + x2 + x3 + ...+ xN = g (x1, g (x2, g (x3, g (...,xN))))

= reduce(g, [x1,x2,x3, ...,xN])

5.4. Aligning Ontologies with MapReduce 31

If an algorithm can be written in this form of using map and reduce, then it can

be trivially parallelized across many processors with minimal contention between shared

resources.

5.4 Aligning Ontologies with MapReduce

Ontology alignment can be thought of in the following manner (see Figure 5.1). Each

ontology is represented as a graph. Each concept is represented as a vertex in a graph. A

vertex in Ontology A is matched to a vertex in Ontology B. When a new relationship exists

between these two vertices, it is represented as an edge.

Figure 5.1: Ontology Alignment

Rewriting Ontology Bridge in pseudo-code gives Algorithm 5.1. Algorithm 5.1 can

be explained as follows:

Step 1 For each pair of classes between the two ontologies,

Step 2 extract features from them.

Step 3 If there is a relationship based on these features, add a match between these

classes .

This approach can be rewritten using higher order functions, such as map (Algorithm

5.2 where align_pair is described in Algorithm 5.3).


Algorithm 5.1 Naive Matching

f o r ( c1 , c2 ) in p r o d u c t ( o1 , o2 ) : # S t e p 1f e a t u r e s = g e t _ f e a t u r e s ( c1 , c2 ) # S t e p 2i f h a s _ r e l a t i o n ( f e a t u r e s ) : # S t e p 3

add_match ( c1 , c2 )

Algorithm 5.2 Matching with higher-order functions

map ( a l i g n _ p a i r , p r o d u c t ( o1 , o2 ) )

Algorithm 5.3 align_pair implementation

def a l i g n _ p a i r ( p a i r ) :c1 , c2 = p a i rf e a t u r e s = g e t _ f e a t u r e s ( c1 , c2 )i f h a s _ r e l a t i o n ( f e a t u r e s ) :

add_match ( c1 , c2 )

In this work, map was implemented as a pool of processes, called “jobs”, running

on separate processors, thereby allowing parallelization of ontology alignment. These jobs

were given batches of 100 ontology class pairs process at a time. These batches were

collected in a “master” process which pushed the job to waiting “worker” processes through

inter-process communication queue. Results similarly fed back to the master through a

inter-process communication queue.

get_features and add_match were potential bottlenecks for parallelization be-

cause they made use of shared resources.

5.5 Architecture

Figure 5.2 shows the architecture. This was based on [1]. Two input ontologies are given

to the system in OWL XML format, pairs are created from these ontologies and distributed

to the worker jobs. These pairs consist of one class from each ontology. Each worker

job extracts feature primitives and determines the relationships that exist using an SVM.

5.6. Comparison to Ontology Bridge 33

Figure 5.2: Architecture

Finally, these results from all the workers are combined to create a new alignment. The

features extracted look up information in upper ontologies such as OpenCyc and WordNet.

The code is organized into small modules, simplifying the system architecture and

aiding debugging, analysis and modification.

The code as implemented is in Appendix ??. This system was run on a system with

two Intel Xeon E5-2670 Sandy Bridge processors. Each one of these processors has 8

cores, giving a total of 16 processing units available. During execution, 16 worker jobs

were created with one “master” processor coordinating data between. This gave the best

performance, likely due to the master node using taking advantage of the IO concurrency.

5.6 Comparison to Ontology Bridge

Scalability was evaluated in the same manner as [1]. Subsets of the ontologies being aligned

were selected randomly, sampling 100, 500 and 1,000 classes from each ontology. The

execution time was measured as these tests were run with various numbers of concurrent

jobs. These results were analyzed to determine the scalability of the system in Chapter 6.

The performance (i.e. time taken) of Ontology Bridge, Ontology Bridge with Branch

and Bound and Parallel Ontology Bridge is shown in Figure 5.3. Ontology Bridge’s time

increases quadratically with the number of class pairs, while Ontology Bridge with Branch


Figure 5.3: Performance of Ontology Bridge variants at various numbers of ontology pairs

and Bound does not grow as fast. The performance of Parallel Ontology Bridge is almost

flat, although it grows quadratically, it grows at much slower rate.

5.7 Feature Extraction

As illustrated in Figure 5.2, classes are extracted from the ontologies. Features are ex-

tracted from these class entries and fed into a support vector machine. OpenCyc is accessed

through a cache, significantly reducing the amount of time to use the OpenCyc ontologies to

create features. Software timers and calculations of alignment metrics are also integrated.

Various graphs and plain text output of results were generated for diagnosis.

Features are turned into a vector sequence that’s normalized during SVM processing.

Table 5.3a shows an example ontology pair. The row “Origin Ontology” contains the ontol-

ogy where the class came from, the row “Class Label” contains the label given the class and

5.7. Feature Extraction 35

Class A Class BOrigin Ontology GO MP

Class Label negative regulation of platelet activation abnormal platelet activationParent Classes abnormal platelet physiology

(a) Ontology Pair for Feature Extraction

Primitive Name Valuecount_opencyc_synonyms 1count_wordnet_synonym 4

has_matching_labels 0has_same_first_word 0has_same_last_word 1

(b) Feature Results

Table 5.3: Example Feature Extraction

the row “Parent Classes” is a comma delineated list of superclasses of the selected class.

Table 5.3 shows the features extracted for these two classes. An example feature extraction

would take the ontology pair shown in Table 5.3a and turn it into the vector [1,4,0,0,1]

with the features in Table 5.3.

The individual vectors are made up of binary values, integers or real numbers. They

are all normalized during SVM training to be in the range [0, 1]. It is relatively easy to

create additional features. Table 5.4 has the list of features used for this thesis. To allow for

valid comparisons, these features are the same as used in Stoutenburg’s work [1]. This re-

quired porting from Java to the Python programming language and some tuning by trial and

error to find the appropriate normalization schemes. These features were originally created

based on analysis of biomedical ontologies assisted by biology and medical experts. These

features make up patterns of relationships between biomedical ontologies. For example

titles that end with the same words are likely related by hyponymy, such as “platelet activa-

tion” and “abnormal platelet activation” and titles that have synonyms in them are likely to

be equivalent or related. Since these features are consistent with the heuristics developed

by these experts, they make a good choice for this domain.

Linguistic features, such as number of words in the labels, structural features, for

example many child concepts a class has, and features looked up in upper ontologies, like


how many synsets the words in two labels have in WordNet or how many shared concepts

they have in OpenCyc, are included in Ontology Bridge and Parallel Ontology Bridge.

5.8 Implementation

The software was implemented in Python using the scikit-learn2 and joblib3 libraries. The

input ontologies were in Web Ontology Language (OWL) format. This is a W3C standard-

ized format for ontologies that can be serialized into XML and other text formats. Two

approaches were used to access upper ontologies. WordNet was accessed through the nat-

ural language tool kit (nltk4), further described in [50]. OpenCyc, which consists of an

upper ontology in OWL format and a reasoning engine, was accessed through its ontology

file. The ontology was extracted into a graph data structure with references to the necessary

relations put into a fast lookup table. The reasoner was not used in this implementation.

The scikit-learn library had the necessary interfaces to libSVM, an implementation of

SVMs that performs well. It included grid search for the parameters of the SVM kernel.

This was used to determine the constants during training.

The joblib library had the implementation of MapReduce used in this thesis. The

number of jobs used during a run was a parameter into this method. In addition, some

joblib caching was used to reduce the load times of various files and training data.

The software used for this thesis has progressed through several iterations of develop-

ment. The final version is written in the Python programming language.

Test-driven development, writing automated test cases before production code [51],

has been used whenever possible. As such, a suite of unit tests exists for most of the

functionality of this software. Table 5.5 shows the test coverage of the various modules of

the implementation in this thesis.

2http : //scikit-learn.org/stable/

3http : //packages.python.org/joblib/

4http : //nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html

5.8. Implementation 37

Feature Primitive Name Descriptioncount_opencyc_hypernyms Number of words that are hypernyms through OpenCyccount_opencyc_hyponyms Number of words that are hyponyms through OpenCyccount_opencyc_synonyms Number of words that are synonyms through OpenCyc

count_wordnet_hypernyms Number of words that are hypernyms through WordNetcount_wordnet_hyponyms Number of words that are hyponyms through WordNetcount_wordnet_synonym Number of words that are synonyms through WordNet

has_matching_labels Whether any labels matchhas_opencyc_subclass_synonym Whether there is a synonym of concept subclass through

OpenCychas_opencyc_superclass_hypernym Whether there is a hypernym of concept superclass through

OpenCychas_opencyc_superclass_hyponym Whether there is a hyponym of concept superclass through

OpenCychas_opencyc_superclass_synonym Whether there is a synonym of concept superclass through

OpenCychas_opencyc_synonym Whether any OpenCyc synonym exists

has_same_beginning Whether there is a shared substring at the starthas_same_ending Whether there is a shared substring at the end

has_same_first_word Whether the first word is the samehas_same_label Whether primary label matches

has_same_last_word Whether the last word is the samehas_stoilos_similarity Stoilos similarity metric [49]

has_sub_prefix Starts with “sub”has_superclass_1 First concept has superclasseshas_superclass_2 Second concept has superclasses

has_wordnet_hypernym Has any hypernym in common in wordnethas_wordnet_hyponym Has any hyponym in common in wordnet

has_wordnet_subclass_synonym Whether there is a synonym of concept subclass throughWordNet

has_wordnet_superclass_hyperym Whether there is a hypernym of concept superclass throughWordNet

has_wordnet_superclass_hyponym Whether there is a hyponym of concept superclass throughWordNet

has_wordnet_superclass_synonym Whether there is a synonym of concept superclass throughWordNet

has_wordnet_synonym Whether any WordNet synonym existsis_opencyc_hypernym Whether any OpenCyc hypernym existsis_opencyc_hyponym Whether any OpenCyc hyponym exists

Table 5.4: Features


Module Name Statements Missed Coveragealign 49 15 69%

import_csv 53 28 47%opencyc 21 9 57%

primitives 245 149 39%utils 25 13 48%Total 393 214 54%

Table 5.5: Unit Test Statement Coverage. This shows how much code in each one of thesemodules is covered by automated unit tests.

There are 32 tests which take 1.85 seconds to run. These tests primarily cover the

functionality of feature primitives, with a few tests for metric calculation, training data

parsing and simple scenarios of alignment.

5.8.1 Issues With Java Implementation

Initially, this software was written in Java, taking advantage of several existing libraries to

deal with ontologies. However, this proved untenable for the following reasons:

Jena This library, the industry standard for dealing with ontologies in Java, was unable to

handle the large number of ontology pairs used in this thesis. Its performance was

poor. The python equivalent, rdflib, used in this thesis, was able to easily handle

large ontologies.

WordNet This resource was difficult to incorporate with the rest of the Java software. It

uses external files in non-standard ways, which makes distributing in a jar file very

difficult. WordNet functions are not idempotent or reentrant, which made multi-

threading very difficult. The Python library did not run into these difficulties.

OpenCyc The author found it extremely slow to use the OpenCyc reasoning engine. The

alternative, the OpenCyc OWL file, was very large and cumbersome. This was mit-

igated by caching. The OpenCyc OWL file was stored in a Python dictionary and

serialized into a b-tree based file, which allowed for very rapid lookup.

5.9. Parallel Ontology Bridge F-Measure 39

5.9 Parallel Ontology Bridge F-Measure

As described in [1], cross validation was used on the reference alignments to determine

F-measure for the hyponymy relationship (shown in Table 5.6). Ten way cross validation

separates the test data into 10 disjoint sets and runs tests on each one, using the remaining

9 sets of data as training. These results are generally consistent with the best results from

[1], which had an average F-Measure of 0.68 for the same test cases. This implies that

the implementation of Parallel Ontology Bridge performs as well as Ontology Bridge for

F-measure.

Based on a cursory analysis of the data, it seems that more generic sub-domains, such

as behavior and immune system have better results than more specific sub-domains such as

Mannose Binding and Bone Modeling.

Phenylaanine Conversion specifically is the only case of several features related to

OpenCyc, since this is the case, no other training data is available in the other domains, so

it does significantly more poorly than the others.

Table 5.6: Cross-Validation Results

(a) Parallel Ontology Bridge

Test Case F-MeasurePlatelet Activation 0.707Mannose Binding 0.685Immune System 0.623

Phenylalanine Conversion 0.424Bone Modeling 0.709Bone Marrow 0.692

Osteoblast Differentiation 0.707Osteoclast Differentiation 0.707

Behavior 0.7Circadian Rhythm 0.7

Average 0.67

(b) Ontology Bridge [1]

F-MeasureBiomedical Test 1 0.7Biomedical Test 2 0.69Biomedical Test 3 0.58

Average 0.68


5.10 Contributions

This thesis implemented Parallel Ontology Bridge, a parallel method of executing the

Ontology Bridge algorithm. The algorithm is broken into map and reduce steps to run

alignments on individual classes concurrently. This approach shows good F-measure on

hyponymy relationships and is generic enough to be used with other ontology alignment

methods and features.

CHAPTER 6

Scalability Results

The scalability of a system refers to its ability to handle larger problems, sometimes by

adding additional resources. In this thesis, it refers specifically to how a system performs

as more computation resources are added.

6.1 Scalability Metrics

For computer systems, there are two major metrics that affect scalability, contention delay

and coherency delay. Contention delay is caused by sequential execution of computations.

This can be due either to the structure of the problem or contention over shared resources.

Coherency delay is the time required for caches and memory hierarchies to be updated with

appropriate data. Coherency delays are always due to implementation of the system.

These relationships are shown in Gunther’s Universal Scalability Law [52], a model

for scalability that will be used throughout this thesis. Gunther’s Universal Scalability law

is

C (p) =p

1 + � (p� 1) + p (p� 1)

where p is the number of processing units, � is contention delay and is coherency

delay. To determine the scalability of a system, data points are fit to this model. The

maximum performance of a system is at p⇤ processing units, given by

p⇤ =

$r1� �

%.

42 Chapter 6. Scalability Results

Adding more processes above p⇤ either does not change or reduce performance. Adding

additional processes after p⇤ is counterproductive because the performance of the system is

reduced as the number of processors added grows above p⇤.

6.2 Scalability Experiments

To analyze the scalability of the Parallel Ontology Bridge system, several experiments

were run with various numbers of computation units available. These data points were

fit to Gunther’s Universal Scalability Law using least squares regression. Least squares

regression is a technique for fitting curves by minimizing the squared error of data points

versus the model.

These data points are shown in Figure 6.1a. Similar data already existed for the other

systems compared, these data are shown in the other sub-figures of Figure 6.1. All of these

data come from running tests on an Amazon HPC Instance with 60.5 GB of RAM and

two Intel Xeon E5-2670, eight-core “Sandy Bridge” processors. This is one of many cloud

computing services offered by Amazon.

In addition to the subsets described below, a full alignment of GO and MP was at-

tempted. This failed due to a contention fault which only occurs rarely, a mutable data

structure does not have an appropriate mutex in some of the library code that was used.

This full scalability test would have compared 211 million pairs in 5.8 days, but only 84

million pairs were successfully processed. Of these 84 million pairs, 8 million were shown

to have a relationship.

Since the algorithm remains unchanged from the test cases described in Section 5.9,

additional human evaluation is unnecessary. For commercial application, additional tuning

and testing is necessary, but these tests should be sufficient to show the scalability of the

approach.

6.3. Other Systems Compared 43

6.3 Other Systems Compared

6.3.1 Gross’s Approaches

Figure 6.1c shows Gross’s [53] intra-node system performance data points and the least

squares fitted curve, while Figure 6.1b shows the same for Gross’s inter-node system [53].

Intra-node means “within one node”, in this case, one computer, while inter-node means

“between nodes”, in this case, multiple computers. Gross’s approaches run various “match-

ers” on ontology pairs in a parallel manner on a single computer. Matchers include string

similarity measures, structural comparisons and semantic lookups. Gross’s inter-node sys-

tem takes the same approach as Gross’s intra-node, except it runs across multiple machines

in a distributed system. This paper only describes the infrastructure to run the matches, not

combine the outputs.

This system ran on the Adult Mouse Anatomy1 (MA) and NCI Thesaurus2 as its test

ontologies. These ontologies are approximately 3000 concepts in size and are used as the

Anatomy track for the OAEI.

6.3.2 Zhang’s Approach

Zhang’s approach is a Hadoop based system [39]. Its performance data points and least

squares fitted curve are shown in Figure 6.1d. This approach constructs “virtual docu-

ments” and uses a term frequency inverse document frequency (TF-IDF) [54] metric for

ontology alignment implemented in map-reduce. TF-IDF is a kind of measure for docu-

ment retrieval, balancing the frequency of specific words against their uniqueness across

documents.

This system ran on the FMA3 and GALEN4 ontologies.

1http : / / www . obofoundry . org / cgi-bin / detail . cgi ? id = adult _ mouse _

anatomy

2http : //ncit.nci.nih.gov/

3http : //sig.biostr.washington.edu/projects/fm/AboutFM.html

4http : //www.co-ode.org/galen/


(a) Parallel Ontology Bridge Curve Fit (b) Gross Inter

(c) Gross Intra (d) Zhang

Figure 6.1: Other Systems Curve Fits

6.4 Input

Table 6.1 shows the ontologies used for these scalability tests and their size. The dif-

ferent input data should not affect the scalability results, since all the values used in the

comparisons are normalized. Systems that use larger input datasets may however uncover

coherency delays if the approach to cache invalidation is not optimal.

The input data for Parallel Ontology Bridge was selected at random from all the pos-

sible pairs.

6.5 Hardware

Table 6.2 shows a comparison of the hardware used in each system. Zhang’s Hadoop

system had the most number of processors and RAM available. Gross’s intra-node had

6.6. Comparison of Scalability 45

System Ontology size

Parallel Ontology BridgeGene Ontology Subset 1,000Mammalian Phenotype Subset 1,000

Gross InterGO Molecular Functional 9,395GO Biological Process 17,104

Gross IntraAdult Mouse Anatomy 3,289NCI Thesaurus (Anatomy) 2,737

ZhangFMA ⇠ 40,000

GALEN ⇠ 40,000

Table 6.1: Input Data

the fewest number of processors and least amount of RAM. Zhang’s approach and Gross’s

inter-node were both distributed systems over a network.

System CPUs Cache Memory ArchitectureParallel Ontology Bridge 16x2.60GHz 20 MB 60.5 GB Single Computer

Gross Inter 16x2.66GHz 8 MB 4x4 GB NetworkGross Intra 4x2.66GHz 8 MB 4 GB Single Computer

Zhang 40x2.4GHz 12 MB 10x24 GB Network

Table 6.2: Architecture Comparison

6.6 Comparison of Scalability

Figure 6.2 shows the scalability of Parallel Ontology Bridge, Gross’s intra-node, Gross’s

inter-node and Zhang’s Hadoop approaches. The ideal system is one that has no contention

or coherency delay. Parallel Ontology Bridge starts to perform worse from the ideal pos-

sible scalability at more than 15 concurrent jobs. Since Parallel Ontology Bridge is the

closest to the ideal, it has the best scalability. Gross’s inter-node approach is the next clos-

est, so it shows the second best scalability. Gross’s inter-node approach performs worse

than the ideal at more than 4 concurrent jobs. Gross’s intra-node approach performs better

than Zhang’s Hadoop approach when fewer than 8 jobs are executed in parallel.

This worse performance may be due to the lower quality hardware used in these other

systems or it may be due to poorer software architecture which causes more contention.


Figure 6.2: Comparison of scalability of ontology alignment approaches based on datapoints

Figure 6.3 shows the system behavior extrapolated beyond the experimental data. Par-

allel Ontology Bridge has maximum effectiveness at approximately 60 jobs and a 30 times

speedup. Gross’s intra-node approach continues to increase in performance slowly, reach-

ing maximum effectiveness with thousands of jobs executing in parallel. Gross’s inter-node

approach declines rapidly after 30 jobs while Zhang’s approach starts to decline slowly at

approximately 100 jobs.

The retrograde performance of Parallel Ontology Bridge is due to it’s coherency delay.

Since Gross’s intra-node approach has significantly lower coherency delay, but higher con-

tention delay, its performance continues to improve slowly as more processors are added.

Zhang’s approach has poor coherency and contention delay compared to the other

approaches, this is why it does not increase in performance early on and plateaus.

6.6. Comparison of Scalability 47

Cross’s inter-node approach has better contention delay than coherency delay, this is

why it shows improvement early on, but has significant retrograde performance as more

processors are added.

Figure 6.3: Comparison of extrapolated scalability of ontology alignment approaches

Table 6.3 shows the scalability metrics for these systems. Parallel Ontology Bridge

has very small contention delay and small coherency delay. In contrast, Gross’s inter-node

system has moderate contention delays and very low coherency delays. This is why Gross’s

inter-node approach does not scale as well as Parallel Ontology Bridge when running a few

jobs, but continues to scale after Parallel Ontology Bridge shows retrograde performance.

Coherency causes performance to drop as the number of jobs increases.

Gross’s inter-node and Parallel Ontology Bridge had dramatically better scalability

than the other two approaches that were compared. This better performance is evident in

both graphs, both experimentally at a small scale and theoretically at a large scale. This is

due to their better scalability metrics.


Zhang’s Hadoop approach has significantly more contention delays than any of the

other systems that were compared. This is possibly due to the use of a single “house-

keeping” node which holds all of the persisted data during execution.

Gross’s intra-node approach has some contention delay and the largest coherency de-

lay. This large coherency delay is what causes it to have retrograde performance so early

on.

Parallel Ontology Bridge has the highest theoretical speedup of any system, a speedup

of 30.6 when 60 processors are used, giving an efficiency of 51%. Gross’s inter-node

approach has the second best speedup at 128,309 processors, giving the worst efficiency of

any system at 0.02%.

Gross’s intra-node system also has good efficiency. The efficiency of Gross’s intra-

node system and Parallel Ontology Bridge may be due to them communicating over a

system bus instead of the network (see Table 6.2).

The overall speed of Parallel Ontology Bridge is much faster than Ontology Bridge.

Table 6.4 shows the execution speed of Ontology Bridge (96 hours) versus that of Parallel

Ontology Bridge (40 minutes).

6.7 Summary of Scalability

This chapter compared results of ontology alignment at the system level. This is the only

appropriate comparison to be made with existing results, since the underlying hardware and

system architecture are different. The causes of scalability performance depend on many

factors. How the problem is broken down, the processor architecture used, how they are

connected, the speed of these connections, how contention is handled and how much and

what type of memory is available.

The Parallel Ontology Bridge system showed better scalability than many similar ap-

proaches that have been published.

6.7. Summary of Scalability 49

Syst

emC

onte

ntio

n�

Coh

eren

cy

p⇤sp

eedu

patp⇤

effic

ienc

yatp⇤

Para

llelO

ntol

ogy

Brid

ge1.01⇥10

�9

0.000272

6030

.651

%G

ross

Inte

r0.0336

5.87⇥10

�11

128,

309

29.7

0.02

%G

ross

Intra

1.93⇥10

�9

0.0123

94.

7753

%Zh

ang

0.0953

0.000115

888.

659.

8%

Tabl

e6.

3:Sc

alab

ility

Met

rics

Com

paris

on


Table 6.4: Speed of Execution

Project Runtime Ontology SizeOntology Bridge 96 hours O (1,000 · 1,000)[1]

Parallel Ontology Bridge 40 minutes O (1,000 · 1,000)[1]

CHAPTER 7

Conclusion

This work describes Parallel Ontology Bridge, an approach to ontology alignment using

support vector machines to find non-equivalence relationships that scales through the use

of parallelization. Parallel Ontology Bridge maintained the alignment quality of the previ-

ous work, Ontology Bridge, at an F-Measure of 0.67, while reducing execution time from

96 hours to 2 hours. In addition, it showed a theoretical scalability factor of 30 with an ef-

ficiency of 51%. This shows that Parallel Ontology Bridge is very scalable. The other sys-

tems compared only had a maximum scalability factor of 29 and efficiency of 9.8%. Like

Ontology Bridge, ontologies were aligned by matching linguistic and structural features in

a support vector machine. However, unlike Ontology Bridge, Parallel Ontology Bridge can

be scaled across many processing units. By using MapReduce with shared memory, Par-

allel Ontology Bridge offers a simple, unique method to parallelizing ontology alignment

which is domain independent.

This work explored the use of MapReduce, human computation, information entropy

and morpheme based extraction and cloud computing for ontology alignment. MapRe-

duce proved to be a straightforward method of implementing this parallelization. It’s a

clear match for the Ontology Bridge system and has been proven in industry. Google uses

MapReduce for their PageRank algorithm which determines the order of search results in

their search engine. Parallel Ontology Bridge shows that a similar approach of investing in

parallel infrastructure can solve the ontology alignment scalability problem. This parallel

infrastructure can come from cloud providers, such as Amazon1 or RackSpace2.

Scalable ontology alignment is a necessary technology for the Semantic Web [55],

both in its full operation and for integrating with existing systems. Scalability ontology

1http : //aws.amazon.com

2http : //rackspace.com/cloud

52 Chapter 7. Conclusion

alignment is still an unsolved problem in the field. Parallel Ontology Bridge shows a

direction, mimicking some of the same outcomes that occurred in more straightforward

document analysis and search engine development.

Parallel Ontology Bridge combines a modular design, MapReduce distribution of jobs,

caching and cloud computing to provide an effective solution to scaling ontology align-

ment. With appropriate training data, it can automatically tune the support vector machine

to optimally align ontologies, eliminating one form of manual tuning.

There were several challenges with this system. The features from Ontology Bridge

had to be ported and appropriately tuned to get the necessary F-measure that Ontology

Bridge showed. Tuning primarily consisted of selecting appropriate feature normalization.

The goal of Parallel Ontology Bridge was to match the F-measure of Ontology Bridge.

This thesis is the first paper in the ontology alignment community which discusses

theoretical scalability of systems. None of the other papers compared address it. Some

papers describe approaches that were similar to this one, but none went into the detail of

analyzing theoretical scalability.

Success and failure of this project was based on two measures: scalability and F-

measure. This project succeeded on both, maintaining F-measure while dramatically in-

creasing scalability.

This thesis contributes an end-to-end parallelization technique to the ontology align-

ment community, as well as the first reported use of cloud computing resources for ontology

alignment.

The work was implemented and empirically shown to scale. The approach shows

promise for scaling ontology alignment because of its good empirical results and simplicity

of approach. There may be additional bottlenecks when using a different hardware system

that cannot be detected from the research in this thesis.

Scaling ontology alignment is still a developing field [28] which has great potential

applications in biomedicine [30].

7.1. Future Work 53

7.1 Future Work

Ontology alignment is a key component to enhancing the research uses of medical ontolo-

gies. Ontology alignment tools need to be better integrated into the work flows of ontology

researchers. Effective research assistance and diagnostic tools need to be developed, as

well as methods for on demand or just-in-time processing of ontology alignment. This

would enable integrated and up to date ontologies to be used in biomedical research with-

out additional effort or cost on the part of the researchers. A well aligned ontology is not

useful if it remains stagnant.

For Parallel Ontology Bridge to be used broadly, an effective approach to gathering

training data is necessary. Bootstrapping techniques with training data and using an in-

cremental learning SVM, which will allow for interactive and continuous training, offer

potential solutions to this problem.

There is future work to incorporate the results and approaches of other ontology align-

ment systems into the architecture described in this thesis. In addition, there is much more

testing and research to compare the results of this thesis to the Ontology Alignment Evalu-

ation Initiative, specifically the physiology and scalability tracks. To do this, an approach

to unsupervised learning for equivalence relationships may be necessary. Unsupervised

learning does not require data to be labeled, and labeling data can be a labor intensive

process.

Future work includes modifying Parallel Ontology Bridge to be a distributed system.

This may require further analysis and design of an approach of sharing data across multiple

processing nodes.

54

APPENDIX A

Code Listings

A.1 Align Implementation

Listing A.1: align.py

from _ _ f u t u r e _ _ import d i v i s i o n

import s y s

import r d f l i b

import numpy

from i t e r t o o l s import p r o d u c t , s t a r m a p

from s k l e a r n import svm , g r i d _ s e a r c h , m e t r i c s , p r e p r o c e s s i n g

c l a s s Br id ge ( ) :

def _ _ i n i t _ _ ( s e l f , name , t r a i n i n g , f e a t u r e s , e x p e c t e d

= [ ] ) :

s e l f . r e l a t i o n _ n a m e = name

s e l f . f e a t u r e s = f e a t u r e s

s e l f . c l a s s i f i e r = C l a s s i f i e r ( t r a i n i n g )

s e l f . e x p e c t e d = e x p e c t e d

s e l f . r e s u l t s = [ ]

s e l f . n e w _ r e l a t i o n s = [ ]

56 Appendix A. Code Listings

def f ( s e l f , c1 , c2 ) :

f e a t u r e _ v e c t o r = [ f ( c1 , c2 , s e l f . g1 , s e l f . g2 ) f o r f

in s e l f . f e a t u r e s ]

match = s e l f . c l a s s i f i e r . c l a s s i f y ( f e a t u r e _ v e c t o r )

l a b e l _ 1 = s t r ( s e l f . g1 . l a b e l ( c1 ) )

l a b e l _ 2 = s t r ( s e l f . g2 . l a b e l ( c2 ) )

a s _ e x p e c t e d = ( l a b e l _ 1 , l a b e l _ 2 ) in s e l f . e x p e c t e d

re turn ( c1 , c2 , f e a t u r e _ v e c t o r , i n t ( match ) ,

a s _ e x p e c t e d )

def a l i g n ( s e l f , g1 , g2 ) :

s e l f . g1 = g1

s e l f . g2 = g2

o1 = g e t _ c l a s s e s ( g1 )

o2 = g e t _ c l a s s e s ( g2 )

s e l f . r e s u l t s = l i s t ( s t a r m a p ( s e l f . f , p r o d u c t ( o1 , o2 ) )

)

s e l f . n e w _ r e l a t i o n s = [ ( c1 , c2 )

f o r ( c1 , c2 , _ , match , _ ) in

s e l f . r e s u l t s ]

def t e s t ( s e l f , t e s t _ d a t a ) :

t u p l e s , l a b e l s = z i p (⇤ t e s t _ d a t a )

r e s u l t s = [ s e l f . c l a s s i f i e r . c l a s s i f y ( t ) f o r t in

t u p l e s ]

re turn m e t r i c s . p r e c i s i o n _ r e c a l l _ f s c o r e _ s u p p o r t (

l a b e l s , r e s u l t s ,

A.1. Align Implementation 57

l a b e l s

= [ 1 , 0 ] )

def p r o g r e s s ( i , t o t a l _ c o m p a r i s o n s , found ) :

s y s . s t d o u t . w r i t e ( ’ \ r ’ )

s y s . s t d o u t . w r i t e ( " { : , } / { : , } c o m p a r i s o n s made , { : , } found

. "

. f o r m a t ( i +1 ,

t o t a l _ c o m p a r i s o n s , found )

)

s y s . s t d o u t . f l u s h ( )

c l a s s C l a s s i f i e r ( ) :

def _ _ i n i t _ _ ( s e l f , d a t a ) :

t u n e d _ p a r a m e t e r s = [ { ’ k e r n e l ’ : [ ’ r b f ’ ] , ’gamma ’ : [1 e

�3, 1e �4] ,

’C ’ : [ 1 , 10 , 100 , 1 0 0 0 ] } ,

{ ’ k e r n e l ’ : [ ’ l i n e a r ’ ] , ’C ’ : [ 1 ,

10 , 100 , 1 0 0 0 ] } ]

s e l f . svm = g r i d _ s e a r c h . GridSearchCV ( svm . SVC ( ) ,

t u n e d _ p a r a m e t e r s )

s e l f . t r a i n ( d a t a )

def c l a s s i f y ( s e l f , v e c t o r ) :

r e s u l t = s e l f . svm . p r e d i c t ( [ v e c t o r ] )

re turn r e s u l t

def t r a i n ( s e l f , d a t a ) :

t u p l e s , l a b e l s = z i p (⇤ d a t a )


s e l f . svm . f i t ( p r e p r o c e s s i n g . n o r m a l i z e ( numpy . a r r a y (

t u p l e s , d t y p e = f l o a t ) ) , l a b e l s )

def g e t _ c l a s s e s ( g ) :

re turn l i s t ( g . s u b j e c t s ( r d f l i b . RDF . type , r d f l i b .OWL. C l a s s

) )

A.2 Parallel Implementation

Listing A.2: joblib_align.py

import i t e r t o o l s

import p i c k l e

import r d f l i b

from d a t e t i m e import d a t e t i m e

from r d f l i b import OWL, RDF

from s k l e a r n . e x t e r n a l s import j o b l i b

import a l i g n

import i m p o r t _ c s v

import p r i m i t i v e s

memory = j o b l i b . Memory ( c a c h e d i r =" cache " )

i m p o r t _ c s v . i m p o r t _ c s v = memory . cache ( i m p o r t _ c s v . i m p o r t _ c s v )

f e a t u r e s = p r i m i t i v e s . members

r e l a t i o n = " hyponymy "

tc_names = [ ’ 01 _ p l a t e l e t _ a c t i v a t i o n ’ ,

’ 02 _mannose_b ind ing ’ ,

A.2. Parallel Implementation 59

’ 03 _immune_system ’ ,

’ 04 _ p h e n y l a l a n i n e _ c o n v e r s i o n ’ ,

’ 05 _bone_model ing ’ ,

’ 06 _bone_marrow ’ ,

’ 07 _ o s t e o b l a s t _ d i f f e r e n t i a t i o n ’ ,

’ 08 _ o s t e o c l a s t _ d i f f e r e n t i a t i o n ’ ,

’ 09 _ b e h a v i o r ’ ,

’ 10 _ c i r c a d i a n _ r h y t h m ’ ]

c a s e s = [ i m p o r t _ c s v . i m p o r t _ c s v ( t c , r e l a t i o n , f e a t u r e s ) f o r

t c in t c_names ]

@memory . cache

def g e t _ t r a i n i n g ( tc_name ) :

o t h e r _ c a s e s = ( c a s e f o r ( t c , c a s e ) in z i p ( tc_names ,

c a s e s ) i f t c != tc_name )

t r a i n i n g = sum ( o t h e r _ c a s e s , [ ] )

re turn t r a i n i n g

def p a r s e _ o n t ( f i l e n a m e ) :

g = r d f l i b . Graph ( )

g . p a r s e ( f i l e n a m e )

re turn g

c l a s s O n t o l o g y S u b s t i t u t e ( ) :

def _ _ i n i t _ _ ( s e l f , f i l e n a m e ) :

s e l f . o n t = p a r s e _ o n t ( f i l e n a m e )

def o b j e c t s ( s e l f , s u b j e c t =None , p r e d i c a t e =None ) :

re turn s e l f . o n t . o b j e c t s ( s u b j e c t , p r e d i c a t e )

def s u b j e c t s ( s e l f , p r e d i c a t e =None , o b j e c t =None ) :


re turn s e l f . o n t . s u b j e c t s ( p r e d i c a t e , o b j e c t )

def l a b e l ( s e l f , s u b j e c t , d e f a u l t = ’ ’ ) :

re turn s e l f . o n t . l a b e l ( s u b j e c t , d e f a u l t )

def g r o u p e r ( n , i t e r a b l e ) :

" C o l l e c t d a t a i n t o f i x e d�l e n g t h chunks o r b l o c k s "

# grouper ( 3 , ’ABCDEFG ’ , ’ x ’ ) ��> ABC DEF Gxx

a r g s = [ i t e r ( i t e r a b l e ) ] ⇤ n

re turn ( i t e r t o o l s . i f i l t e r ( None , g )

f o r g in i t e r t o o l s . i z i p _ l o n g e s t ( f i l l v a l u e =None ,

⇤ a r g s ) )

i f __name__ == ’ __main__ ’ :

tc_name = ’ a l l ’

t r a i n i n g = g e t _ t r a i n i n g ( tc_name )

g1 = O n t o l o g y S u b s t i t u t e ( " t e s t _ c a s e s / "+ tc_name+" /

GOTestCase . owl " )

g2 = O n t o l o g y S u b s t i t u t e ( " t e s t _ c a s e s / "+ tc_name+" /

MPTestCase . owl " )

g 1 _ c l a s s e s = g1 . s u b j e c t s (RDF . type , OWL. C l a s s )

g 2 _ c l a s s e s = g2 . s u b j e c t s (RDF . type , OWL. C l a s s )

p a i r s = i t e r t o o l s . p r o d u c t ( g 1 _ c l a s s e s , g 2 _ c l a s s e s )

c = a l i g n . C l a s s i f i e r ( t r a i n i n g )

def f ( c1 , c2 ) :

f e a t u r e _ v e c t o r = [ f ( c1 , c2 , g1 , g2 ) f o r f in

f e a t u r e s ]

match = c . c l a s s i f y ( f e a t u r e _ v e c t o r )

l a b e l _ 1 = s t r ( g1 . l a b e l ( c1 ) )

l a b e l _ 2 = s t r ( g2 . l a b e l ( c2 ) )

re turn ( c1 , c2 , i n t ( match ) )

A.3. Primitives Implementation 61

s t a r t = d a t e t i m e . now ( )

j o b s = j o b l i b . P a r a l l e l ( n _ j o b s =16 , v e r b o s e =1 ,

p r e _ d i s p a t c h = ’ 100⇤ n _ j o b s ’ )

f o r s u b s e t in g r o u p e r (25000 , p a i r s ) :

r e s u l t s = j o b s ( j o b l i b . d e l a y e d ( f ) ( c1 , c2 )

f o r ( c1 , c2 ) in s u b s e t )

end = d a t e t i m e . now ( )

p r i n t " t ime e l a p s e d " , s t r ( end � s t a r t )

f o r a , b , m in r e s u l t s :

p r i n t s t r ( a ) , s t r ( b ) , m

A.3 Primitives Implementation

Listing A.3: primitives.py

" " "

E v i d i e n c e p r i m i t i v e e x t r a c t o r s , a l l f o l l o w t h e g e n e r i c form

o f

f ( c1 , c2 , o1 , o2 )

where c1 i s a c l a s s from o n t o l o g y o1 and c2 i s a c l a s s from

o n t o l o g y o2 .

" " "

import i n s p e c t

import s y s

import s h e l v e

import opencyc


import r d f l i b

from i t e r t o o l s import p r o d u c t , s t a r m a p

from c o n t e x t l i b import c l o s i n g

from n l t k . c o r p u s import wordne t

from r d f l i b import RDFS

from u t i l s import g e t _ l a b e l s

r d f l i b . p l u g i n . r e g i s t e r ( ’ t e x t / xml ’ , r d f l i b . p l u g i n . P a r s e r ,

’ r d f l i b . p l u g i n s . p a r s e r s . r d f xm l ’ , ’

RDFXMLParser ’ )

opencyc_db = opencyc . OpenCyc ( )

def h a s _ s a m e _ l a b e l ( c1 , c2 , o1 , o2 ) :

re turn o1 . l a b e l ( c1 , d e f a u l t =None ) == o2 . l a b e l ( c2 ,

d e f a u l t =None ) != None

def count_wordnet_synonym ( c1 , c2 , o1 , o2 ) :

c o u n t = 0

f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :

c o u n t += l e n ( s e t ( wordne t . s y n s e t s ( w1 ) ) & s e t ( wordne t .

s y n s e t s ( w2 ) ) )

re turn c o u n t

def coun t_wordne t_hypernyms ( c1 , c2 , o1 , o2 ) :

c o u n t = 0



f o r s2 in wordne t . s y n s e t s ( w2 ) :

c o u n t += sum ( s1 in hypernyms ( s2 , 5 ) f o r s1 in

wordne t . s y n s e t s ( w1 ) )

re turn c o u n t

def count_wordnet_hyponyms ( c1 , c2 , o1 , o2 ) :

re turn F a l s e

def coun t_opencyc_hypernyms ( c1 , c2 , o1 , o2 ) :

c o u n t = 0


c o u n t += w1 in a n c e s t o r s ( w2 )

re turn c o u n t

def count_opencyc_hyponyms ( c1 , c2 , o1 , o2 ) :

c o u n t = 0


c o u n t += w2 in a n c e s t o r s ( w1 )

re turn c o u n t

def h a s _ s a m e _ b e g i n n i n g ( c1 , c2 , o1 , o2 ) :

re turn any ( l 1 . s t a r t s w i t h ( l 2 ) or l 2 . s t a r t s w i t h ( l 2 )

f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) )

def has_same_end ing ( c1 , c2 , o1 , o2 ) :

re turn any ( l 1 [ : : � 1 ] . s t a r t s w i t h ( l 2 [ : : � 1 ] ) or l 2 [ : : � 1 ] .

s t a r t s w i t h ( l 2 [ : : � 1 ] )


def synonym ( w1 , w2 ) :


re turn w1 == w2 and w1 in opencyc_db

def count_opencyc_synonyms ( c1 , c2 , o1 , o2 ) :

re turn sum ( synonym ( w1 , w2 ) f o r ( w1 , w2 ) in

g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) )

def h a s _ s a m e _ f i r s t _ w o r d ( c1 , c2 , o1 , o2 ) :

re turn any ( l 1 . s p l i t ( ) [ 0 ] == l 2 . s p l i t ( ) [ 0 ]


def h a s _ s a m e _ l a s t _ w o r d ( c1 , c2 , o1 , o2 ) :

re turn any ( l 1 . s p l i t ( ) [�1] == l 2 . s p l i t ( ) [�1]


def h a s _ m a t c h i n g _ l a b e l s ( c1 , c2 , o1=None , o2=None ) :

re turn any ( l 1 == l 2 f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 ,

o1 , o2 ) )

def h a s _ s u b _ p r e f i x ( c1 , c2 , o1 , o2 ) :

re turn any ( s t a r m a p ( s u b _ p r e f i x , g e t _ w o r d _ p a i r s ( c1 , c2 , o1

, o2 ) ) )

def h a s _ s u p e r c l a s s _ 1 ( c1 , c2 , o1 , o2 ) :

re turn boo l ( l i s t ( g e t _ p a r e n t s ( c1 , o1 ) ) )

def h a s _ s u p e r c l a s s _ 2 ( c1 , c2 , o1 , o2 ) :

re turn boo l ( l i s t ( g e t _ p a r e n t s ( c2 , o2 ) ) )

def ha s_o pen cy c_s ubc l a s s_s yno nym ( c1 , c2 , o1 , o2 ) :

f o r s u b c l a s s in s u b c l a s s e s ( c1 , o1 ) :


f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u b c l a s s , c2 , o1 , o2 )

:

i f w1 == w2 and w1 in opencyc_db : re turn True

re turn F a l s e

def h a s _ o p e n c y c _ s u p e r c l a s s _ h y p e r n y m ( c1 , c2 , o1 , o2 ) :

f o r s u p e r c l a s s in g e t _ p a r e n t s ( c1 , o1 ) :

f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u p e r c l a s s , c2 , o1 ,

o2 ) :

i f w1 in a n c e s t o r s ( w2 ) : re turn True

re turn F a l s e

def h a s _ o p e n c y c _ s u p e r c l a s s _ s y n o n y m ( c1 , c2 , o1 , o2 ) :



o2 ) :


re turn F a l s e

def h a s _ o p e n c y c _ s u p e r c l a s s _ h y p o n y m ( c1 , c2 , o1 , o2 ) :



o2 ) :

i f w2 in a n c e s t o r s ( w1 ) : re turn True

re turn F a l s e

def s u b c l a s s e s ( c l s , o n t ) :

re turn o n t . s u b j e c t s (RDFS . subClassOf , c l s )

def has_wordnet_synonym ( c1 , c2 , o1 , o2 ) :



i f s e t ( wordne t . s y n s e t s ( w1 ) ) & s e t ( wordne t . s y n s e t s ( w2

) ) :

re turn True

re turn F a l s e

def has_wordnet_hypernym ( c1 , c2 , o1 , o2 ) :




i f s1 in hypernyms ( s2 , 5 ) : re turn True

re turn F a l s e

def has_wordnet_hyponym ( c1 , c2 , o1 , o2 ) :




i f s2 in hypernyms ( s1 , 5 ) :

re turn True

re turn F a l s e

def h a s _ w o r d n e t _ s u b c l a s s _ s y n o n y m ( c1 , c2 , o1 , o2 ) :

f o r s u b c l a s s in s u b c l a s s e s ( c1 , o1 ) :

f o r ( w1 , w2 ) in g e t _ w o r d _ p a i r s ( s u b c l a s s , c2 , o1 , o2 )

:

i f s e t ( wordne t . s y n s e t s ( w1 ) ) & s e t ( wordne t .

s y n s e t s ( w2 ) ) :

re turn True

re turn F a l s e


def h a s _ w o r d n e t _ s u p e r c l a s s _ s y n o n y m ( c1 , c2 , o1 , o2 ) :



o2 ) :

i f s e t ( wordne t . s y n s e t s ( w1 ) ) & s e t ( wordne t .

s y n s e t s ( w2 ) ) :

re turn True

re turn F a l s e

def h a s _ w o r d n e t _ s u p e r c l a s s _ h y p o n y m ( c1 , c2 , o1 , o2 ) :



o2 ) :




re turn True

re turn F a l s e

def h a s _ w o r d n e t _ s u p e r c l a s s _ h y p e r y m ( c1 , c2 , o1 , o2 ) :



o2 ) :




re turn True

re turn F a l s e

def hypernyms ( s y n s e t , l e v e l ) :


i f l e v e l == 1 :

re turn s y n s e t . hypernyms ( )

hyps = s y n s e t . hypernyms ( )

f o r s in s y n s e t . hypernyms ( ) :

hyps += hypernyms ( s , l e v e l �1)

re turn hyps

def has_opencyc_synonym ( c1 , c2 , o1 , o2 ) :



re turn F a l s e

def i s_opencyc_hypernym ( c1 , c2 , o1 , o2 ) :


re turn w1 in a n c e s t o r s ( w2 )

def i s_opencyc_hyponym ( c1 , c2 , o1 , o2 ) :


re turn w2 in a n c e s t o r s ( w1 )

def g e t _ p a r e n t s ( c l s , o n t ) :

re turn o n t . o b j e c t s ( c l s , RDFS . s u b C l a s s O f )

def l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) :

l a b e l s 1 = l a b e l s _ f o r ( c1 , o1 )

l a b e l s 2 = l a b e l s _ f o r ( c2 , o2 )

re turn p r o d u c t ( l a b e l s 1 , l a b e l s 2 )

def g e t _ w o r d _ p a i r s ( c1 , c2 , o1 , o2 ) :

f o r n1 , n2 in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) :


f o r ( w1 , w2 ) in p r o d u c t ( n1 . s p l i t ( ) , n2 . s p l i t ( ) ) :

y i e l d w1 , w2

def l a b e l s _ f o r ( c l s , o n t ) :

l a b e l s = [ c l s . s p l i t ( " # " ) [ �1]] i f " # " in c l s e l s e [ c l s ]

i f o n t and g e t _ l a b e l s ( c l s , o n t ) :

l a b e l s = g e t _ l a b e l s ( c l s , o n t )

re turn l a b e l s

def a n c e s t o r s ( c o n c e p t ) :

i f c o n c e p t not in opencyc_db : re turn [ ]

re turn [ c f o r c in opencyc_db [ c o n c e p t ] i f c != c o n c e p t ]

def s u b _ p r e f i x ( s1 , s2 ) :

re turn s1 == " sub " + s2

def h a s _ s t o i l o s _ s i m i l a r i t y ( c1 , c2 , o1 , o2 ) :

f o r ( l1 , l 2 ) in l a b e l _ p a i r s ( c1 , c2 , o1 , o2 ) :

re turn s t o i l o s _ s i m i l a r i t y ( l1 , l 2 )

def s t o i l o s _ s i m i l a r i t y ( s t 1 , s t 2 ) :

i f s t 1 == None or s t 2 == None : re turn �1

s1 = s t 1 . lower ( )

s2 = s t 2 . lower ( )

s1 = s1 . r e p l a c e ( ’ . ’ , ’ ’ )

s2 = s2 . r e p l a c e ( ’ . ’ , ’ ’ )

s1 = s1 . r e p l a c e ( ’ _ ’ , ’ ’ )

s2 = s2 . r e p l a c e ( ’ _ ’ , ’ ’ )


s1 = s1 . r e p l a c e ( ’ ’ , ’ ’ )

s2 = s2 . r e p l a c e ( ’ ’ , ’ ’ )

l 1 = l e n ( s1 )

l 2 = l e n ( s2 )

L1 = l 1

L2 = l 2

i f L1 == 0 and L2 == 0 : re turn 1

i f L1 == 0 or L2 == 0 : re turn 0

common = 0

b e s t = 2

whi le l e n ( s1 ) > 0 and l e n ( s2 ) > 0 and b e s t !=0 :

b e s t = 0

l 1 = l e n ( s1 )

l 2 = l e n ( s2 )

i = 0

j = 0

s t a r t S 2 = 0

endS2 = 0

s t a r t S 1 = 0

endS1 = 0

p=0


f o r i in x r a ng e ( l 1 ) :

i f not ( l 1 � i > b e s t ) : break

j = 0 ;

whi le ( l 2 � j > b e s t ) :

k = i ;

whi le j < l 2 and s1 [ k ] != s2 [ j ] : j += 1

i f j != l 2 :

p = j

j += 1

k += 1

whi le ( j < l 2 ) and ( k < l 1 ) and s1 [ k ] ==

s2 [ j ] :

k += 1

j += 1

i f k�i > b e s t :

b e s t = k�i

s t a r t S 1 = i

endS1 = k

s t a r t S 2 = p

endS2 = j

s1 = s1 [ s t a r t S 1 : endS1 ]

s2 = s2 [ s t a r t S 2 : endS2 ]

commonal i ty = 0

scaledCommon = f l o a t (2⇤common ) / ( L1+L2 )

commonal i ty = scaledCommon ;

w i n k l e r = wink l e r Improvemen t ( s t 1 , s t 2 ,

commonal i ty ) ;


d i s s i m i l a r i t y = 0 ;

r e s t 1 = L1 � common ;

r e s t 2 = L2 � common ;

unmatchedS1 = max ( r e s t 1 , 0 )

unmatchedS2 = max ( r e s t 2 , 0 )

unmatchedS1 = r e s t 1 / L1

unmatchedS2 = r e s t 2 / L2

# Hamacher Produc t

suma = unmatchedS1 + unmatchedS2 ;

p r o d u c t = unmatchedS1 ⇤ unmatchedS2 ;

p = 0 . 6 ; # For 1 i t c o i n c i d e s w i t h t h e a l g e b r a i c

p r o d u c t

i f ( ( suma�p r o d u c t ) == 0 ) :

d i s s i m i l a r i t y = 0 ;

e l s e :

d i s s i m i l a r i t y = ( p r o d u c t ) / ( p+(1�p ) ⇤ ( suma�

p r o d u c t ) ) ;

# M o d i f i c a t i o n JE : r e t u r n e d n o r m a l i z a t i o n ( i n s t e a d

o f [�1 1 ] )

r e s u l t = commonal i ty � d i s s i m i l a r i t y + w i n k l e r ;

re turn ( r e s u l t +1) / 2 ;

def wink le r Improvemen t ( s1 , s2 , commonal i ty ) :

n = min ( l e n ( s1 ) , l e n ( s2 ) )

f o r i in x r a n g e ( n ) :

i f s1 [ i ] != s2 [ i ] :

A.4. OpenCyc Implementation 73

break

commonPref ixLength = min ( 4 , i ) ;

w i n k l e r = commonPref ixLength ⇤0.1⇤(1 � commonal i ty )

re turn w i n k l e r

members = [ func f o r ( name , func ) in i n s p e c t . getmembers ( s y s .

modules [ __name__ ] )

i f i n s p e c t . i s f u n c t i o n ( func ) and

( name . s t a r t s w i t h ( ’ has_ ’ ) or name . s t a r t s w i t h ( ’

co un t_ ’ ) or name . s t a r t s w i t h ( ’ i s _ ’ ) ) ]

A.4 OpenCyc Implementation

Listing A.4: opencyc.py

# OpenCyc S h e l v e

# Open OpenCyc

# I t e r a t e t h r o u g h each c o n c e p t t h a t has a l a b e l

# For each c o n c e p t t h a t has a l a b e l , f i n d a l l t h e a n c e s t o r s

up t o l e v e l N

# C re a t e a d i c t i o n a r y o f them o f t h e form

# { key : [ [ p a r e n t s ] , [ g r a n d p a r e n t s ] , . . . ] }

#

# Phase two : Open OpenCyc and g e t each t h i n g t h a t has a

l a b e l , c r e a t e a

# l i s t o f t h e t r a n s i t i v e c l o s u r e over OWL. s u b C l a s s O f

# Phase t h r e e : Open OpenCyc and p a r s e i n t o l i s t o f l i s t s

based on p a r e n t s ,


# g r a n d p a r e n t s e t c . by do ing a bread th� f i r s t t r a v e r s a l over

OWL. s u b C l a s s O f

import r d f l i b

import s h e l v e

from r d f l i b import RDFS

import c P i c k l e a s p i c k l e

c l a s s OpenCyc ( ) :

def _ _ i n i t _ _ ( s e l f ) :

s e l f . s h e l f = s h e l v e . open ( " opencyc . s h e l v e " )

def _ _ c o n t a i n s _ _ ( s e l f , i t em ) :

re turn i t em . encode ( ’UTF�8 ’ ) in s e l f . s h e l f

def _ _ g e t i t e m _ _ ( s e l f , i t em ) :

re turn s e l f . s h e l f [ i t em . encode ( ’UTF�8 ’ ) ]

i f __name__ == " __main__ " :

g = r d f l i b . Graph ( )

g . p a r s e ( " o n t o l o g i e s / opencyc �2012�05�10� r e a d a b l e . owl " )

p i c k l e . dump ( g , open ( ’ opencyc . p i c k l e ’ , ’w’ ) )

nodes = ( n f o r n in g . a l l _ n o d e s ( ) i f g . l a b e l ( n ) )

s = s h e l v e . open ( " opencyc . s h e l v e " )

f o r n in nodes :

key = g . l a b e l ( n ) . encode ( ’UTF�8 ’ )

s [ key ] = [ g . l a b e l ( n ) . encode ( ’uTF�8 ’ )

f o r n in g . t r a n s i t i v e _ o b j e c t s ( n , RDFS .

s u b C l a s s O f )

i f g . l a b e l ( n ) ]

A.4. OpenCyc Implementation 75

s . c l o s e ( )

76

Bibliography 77

Bibliography

[1] S. K. Stoutenburg, “Advanced ontology alignment: New methods for biomedical on-

tology alignment using non-equivalence relations,” Ph.D. dissertation, University of

Colorado at Colorado Springs, 2009.

[2] L. Sweetlove. (2011) Number of species on earth tagged at 8.7 million. Nature.

[Online]. Available: http : / / www. nature . com / news / 2011 / 110823 / full / news .

2011 . 498 . html

[3] (2012, September 17th) Wikipedia live statistics page. [Online]. Available:

http : / / stats . wikimedia . org / EN / TablesWikipediaZZ . htm # distribution

[4] T. G. O. Consortium, “Gene ontology: tool for the unification of biology,” Natural

Genetics, vol. 25(1), pp. 25–9, May 2000.

[5] D. Rubin, N. Shah, and N. Noy, “Biomedical ontologies: a functional perspective,”

Briefings in Bioinformatics, vol. 9, no. 1, pp. 75–90, 2008.

[6] P. Shvaiko and J. Euzenat, “Ontology matching: state of the art and future challenges,”

IEEE Transactions on Knowledge and Data Engineering, vol. 99, 2012.

[7] J. Euzenat and P. Shvaiko, Ontology Matching. Springer, 2007.

[8] C. Smith, C. Goldsmith, J. Eppig et al., “The mammalian phenotype ontology as a

tool for annotating, analyzing and comparing phenotypic information,” Genome Biol,

vol. 6, no. 1, p. R7, 2005.

[9] D. L. McGuinness, F. Van Harmelen et al., “Owl web ontology language overview,”

W3C recommendation, vol. 10, no. 2004-03, p. 10, 2004.

78 Bibliography

[10] C. Elkan and R. Greiner, “Building large knowledge-based systems: Representation

and inference in the cyc project: Db lenat and rv guha,” Artificial Intelligence, vol. 61,

no. 1, pp. 41–52, 1993.

[11] J. David, F. Guillet, and H. Briand, “Matching directories and owl ontologies with

aroma,” in Conference on Information and Knowledge Management: Proceedings of

the 15 th ACM international conference on Information and knowledge management,

vol. 6, no. 11, 2006, pp. 830–831.

[12] J. Noessner and M. Niepert, “Codi: Combinatorial optimization for data integration–

results for oaei 2010,” Ontology Matching, p. 142, 2010.

[13] M. Hussain and S. Srivatsa, “A study of different ontology matching system,” Inter-

national Journal of Computer Applications (0975–8887) Volume, 2012.

[14] I. F. Cruz, F. P. Antonelli, and C. Stroe, “Agreementmaker: efficient matching for large

real-world schemas and ontologies,” Proceedings of the VLDB Endowment, vol. 2,

no. 2, pp. 1586–1589, 2009.

[15] N. Jian, W. Hu, G. Cheng, and Y. Qu, “Falcon-ao: Aligning ontologies with falcon,”

in In: K-Cap 2005 Workshop on Integrating Ontologies., 2005, pp. 87–93.

[16] E. Rahm, “Towards large-scale schema and ontology matching,” Schema matching

and mapping, pp. 3–27, 2011, Springer.

[17] I. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel

Software Engineering. Addison-Wesley, 1995.

[18] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,”

Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

[19] M. A. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf, “Support vector

machines,” Intelligent Systems and their Applications, IEEE, vol. 13, no. 4, pp. 18–

28, 1998.

Bibliography 79

[20] J. Euzenat, A. Ferrara, L. Hollink, A. Isaac, C. Joslyn, V. Malaisé, C. Meil-

icke, A. Nikolov, J. Pane, M. Sabou, F. Scharffe, P. Shvaiko, V. Spiliopou-

los, H. Stuckenschmidt, O. Šváb Zamazal, V. Svátek, C. Trojahn, G. Vouros,

and S. Wang, “Results of the ontology alignment evaluation initiative 2009,”

http://eprints.biblio.unitn.it/1807/1/006.pdf.

[21] N. F. Noy, “Semantic integration: a survey of ontology-based approaches,” SIGMOD

record, vol. 33, no. 4, pp. 65–70, 2004.

[22] A. Doan, J. Madhavan, P. Domingos, and A. Halevy, “Ontology matching: A machine

learning approach,” Handbook on Ontologies in Information Systems, pp. 397–416,

2004.

[23] B. Smith, W. Ceusters, B. Klagges, J. Köhler, A. Kumar, J. Lomax, C. Mungall,

F. Neuhaus, A. L. Rector, and C. Rosse, “Relations in biomedical ontologies,”

Genome biology, vol. 6, no. 5, p. R46, 2005.

[24] A. Johnson and C. O’Donnell, “An open access database of genome-wide association

results,” BMC medical genetics, vol. 10, no. 1, p. 6, 2009.

[25] A. Gruzdz, A. Ihnatowicz, J. Siddiqi, and B. Akhgar, “Mining genes relations in mi-

croarray data combined with ontology in colon cancer automated diagnosis system,”

World Academy of Science, Engineering and Technology, vol. 16, no. 26, pp. 140–

144, 2006.

[26] N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D. L.

Rubin, M.-A. Storey, C. G. Chute et al., “Bioportal: ontologies and integrated data

resources at the click of a mouse,” Nucleic acids research, vol. 37, no. suppl 2, pp.

W170–W173, 2009.

[27] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J. Goldberg,

K. Eilbeck, A. Ireland, C. J. Mungall et al., “The obo foundry: coordinated evolution

of ontologies to support biomedical data integration,” Nature biotechnology, vol. 25,

no. 11, pp. 1251–1255, 2007.

80 Bibliography

[28] J. Euzenat, C. Meilicke, H. Stuckenschmidt, P. Shvaiko, and C. Trojahn, “Ontology

alignment evaluation initiative: six years of experience,” Journal on Data Semantics

XV, vol. n/a, pp. 158–192, 2011.

[29] J. Aguirre, B. Grau, K. Eckert, J. Euzenat, A. Ferrara, R. van Hague, L. Hollink,

E. Jimenez-Ruiz, C. Meilicke, A. Nikolov et al., “Results of the ontology align-

ment evaluation initiative 2012,” in Proc. 7th International Semantic Web Conference

Workshop on Ontology Matching (OM), Boston, MA, 2012, pp. 73–115.

[30] O. Bodenreider and A. Burgun, “Biomedical ontologies,” Medical Informatics, pp.

211–236, 2005.

[31] G. A. Miller et al., “Wordnet: a lexical database for english,” Communications of the

ACM, vol. 38, no. 11, pp. 39–41, 1995.

[32] T. F. Hayamizu, M. Mangan, J. P. Corradi, J. A. Kadin, M. Ringwald et al., “The adult

mouse anatomical dictionary: a tool for annotating and integrating data,” Genome

biology, vol. 6, no. 3, p. R29, 2005.

[33] N. Sioutos, S. d. Coronado, M. W. Haber, F. W. Hartel, W.-L. Shaiu, and L. W. Wright,

“Nci thesaurus: a semantic model integrating cancer-related clinical and molecular

information,” Journal of biomedical informatics, vol. 40, no. 1, pp. 30–43, 2007.

[34] A. Ghazvinian, P. Natalya F. Noy, and P. Mark A. Musen, MD, “Creating mappings

for ontologies in biomedicine: Simple methods work,” in AMIA Annual Symposium

Proceedings. San Francisco, CA: American Medical Informatics Association, 2009,

p. 198.

[35] W. Hu, Y. Zhao, D. Li, G. Cheng, H. Wu, and Y. Qu, “Falcon-AO: results for OAEI

2007,” in Proceedings of the International Workshop on Ontology Matching, Busan,

Korea, 2007.

[36] J. David, F. Guillet, and H. Briand, “Association rule ontology matching approach,”

International Journal of Semantic Web Information Systems, vol. 3, no. 2, pp. 27–49,

2007.

Bibliography 81

[37] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of

items in large databases,” in ACM SIGMOD Record, vol. 22, no. 2. ACM, 1993, pp.

207–216.

[38] P. Xu, Y. Wang, L. Cheng, and T. Zang, “Alignment results of SOBOM for OAEI

2010,” Ontology Matching, p. 203, 2010.

[39] H. Zhang, W. Hu, and Y. Qu, “Vdoc+: a virtual document based approach for match-

ing large ontologies using mapreduce,” Journal of Zhejiang University-Science C,

vol. 13, no. 4, pp. 257–267, 2012.

[40] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3,

pp. 273–297, 1995.

[41] A. J. Smola, B. Schölkopf, and K.-R. Müller, “The connection between regularization

operators and support vector kernels,” Neural Networks, vol. 11, no. 4, pp. 637–649,

1998.

[42] J. Euzenat, “Semantic precision and recall for ontology alignment evaluation,” in

Proc. 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyder-

abad, India, 2007, pp. 348–353.

[43] Z. Xuegong, “Introduction to statistical learning theory and support vector machines,”

Acta Automatica Sinica, vol. 26, no. 1, pp. 32–42, 2000.

[44] R. Neches, R. Fikes, T. Finin, T. Gruber, R. Patil, T. Senator, and W. Swartout, “En-

abling technology for knowledge sharing,” AI magazine, vol. 12, no. 3, p. 36, 1991.

[45] V. Sheng, F. Provost, and P. Ipeirotis, “Get another label? improving data quality and

data mining using multiple, noisy labelers,” in Proceeding of the 14th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining. Las Vegas,

NV: ACM, 2008, pp. 614–622.

[46] M. Bernstein, G. Little, R. Miller, B. Hartmann, M. Ackerman, D. Karger, D. Crowell,

and K. Panovich, “Soylent: a word processor with a crowd inside,” in Proceedings of

82 Bibliography

the 23nd Annual ACM Symposium on User Interface Software and Technology. New

York City, NY: ACM, 2010, pp. 313–322.

[47] P. Wais, S. Lingamneni, D. Cook, J. Fennell, B. Goldenberg, D. Lubarov, D. Marin,

and H. Simons, “Towards building a high-quality workforce with mechanical turk,”

in Proceedings of Computational Social Science and the Wisdom of Crowds (NIPS),

Vancouver, BC, 2010, pp. 1–5.

[48] S. K. Stoutenburg, J. Kalita, K. Ewing, and L. M. Hines, “Scaling alignment of large

ontologies,” International journal of bioinformatics research and applications, vol. 6,

no. 4, pp. 384–401, 2010.

[49] G. Stoilos, G. Stamou, and S. Kollias, “A string metric for ontology alignment,” in

The Semantic Web–International Semantic Web Conference 2005. Galway, Ireland:

Springer, 2005, pp. 624–637.

[50] E. Loper and S. Bird, “Nltk: the natural language toolkit,” in Proceedings of

the ACL-02 Workshop on Effective tools and methodologies for teaching natural

language processing and computational linguistics - Volume 1, ser. ETMTNLP ’02.

Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp. 63–70.

[Online]. Available: http : / / dx . doi . org / 10 . 3115 / 1118108 . 1118117

[51] K. Beck, Test-driven development: by example. Addison-Wesley Professional, 2003.

[52] N. Gunther, Guerrilla capacity planning : a tactical approach to planning for highly

scalable applications and services. Berlin London: Springer, 2011.

[53] A. Gross, M. Hartung, T. Kirsten, and E. Rahm, “On matching large life science

ontologies in parallel,” in Data Integration in the Life Sciences. Springer, 2010, pp.

35–49.

[54] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image detection: min-hash

and tf-idf weighting,” in Proceedings of the British Machine Vision Conference, vol. 3,

2008, p. 4.

Bibliography 83

[55] A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy, “Learning to

match ontologies on the semantic web,” The VLDB Journal, vol. 12, no. 4, pp. 303–

319, 2003.

Date post:	19-Jul-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

SCALING ONTOLOGY ALIGNMENTjkalita/work/StudentResearch/... · bility and speed of ontology...

Documents