University of Manchester
School of Computer Science
Degree Programme of Advanced Computer Science
Enhanced Ontological Searching of
Medical Scientific Information
Christos Karaiskos
A dissertation submitted to The University of Manchester for the degree of
Master of Science in the Faculty of Engineering and Physical Sciences
Master’s Thesis
2013
2
Contents
Abstract 7
Declaration 9
Intellectual Property Statement 11
Acknowledgements 13
List of Abbreviations 15
List of Tables 17
List of Figures 19
1 Introduction 25
1.1 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2 Ontologies 31
2.1 Modern Ontology Definition . . . . . . . . . . . . . . . . . . . . . 31
2.2 Ontology vs. Terminology . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Notable Biomedical Ontologies and Terminologies . . . . . . . . . 34
2.3.1 SNOMED CT . . . . . . . . . . . . . . . . . . . . . . . . . 34
3
2.3.2 NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 ICD-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.4 MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.5 NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Similarity Metrics 39
3.1 Similarity Metric vs. Distance Metric . . . . . . . . . . . . . . . . 39
3.2 Lexical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Character-based Similarity Measures . . . . . . . . . . . . 41
Longest Common Substring . . . . . . . . . . . . . . . . . 41
Hamming Similarity . . . . . . . . . . . . . . . . . . . . . 41
Levenshtein Similarity . . . . . . . . . . . . . . . . . . . . 41
Jaro Similarity . . . . . . . . . . . . . . . . . . . . . . . . 42
Jaro-Winkler Similarity . . . . . . . . . . . . . . . . . . . 42
N-gram Similarity . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Word-based Similarity Measures . . . . . . . . . . . . . . . 43
Dice Similarity . . . . . . . . . . . . . . . . . . . . . . . . 43
Jaccard Similarity . . . . . . . . . . . . . . . . . . . . . . . 44
Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . 44
Manhattan Similarity . . . . . . . . . . . . . . . . . . . . . 44
Euclidean Similarity . . . . . . . . . . . . . . . . . . . . . 45
3.3 Ontological Semantic Similarity . . . . . . . . . . . . . . . . . . . 45
3.3.1 Intra-ontology Semantic Similarity . . . . . . . . . . . . . 45
Distance-based Metrics . . . . . . . . . . . . . . . . . . . . 45
Information-Based Metrics . . . . . . . . . . . . . . . . . . 48
Feature-Based Measures . . . . . . . . . . . . . . . . . . . 52
3.3.2 Inter-ontology Semantic Similarity . . . . . . . . . . . . . 52
4 Search Interfaces 55
4.1 Information Seeking Models . . . . . . . . . . . . . . . . . . . . . 55
4.2 Query Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4
4.3 Presentation of Search Results . . . . . . . . . . . . . . . . . . . . 60
4.4 Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Requirements 65
5.1 Feature Specification . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Design 69
6.1 Stage I: Access to Medical Ontologies . . . . . . . . . . . . . . . . 69
6.1.1 Database and Table Creation . . . . . . . . . . . . . . . . 70
6.1.2 Populating the Database Tables . . . . . . . . . . . . . . . 72
6.2 Stage II: Computation of Semantic Similarity . . . . . . . . . . . 76
6.2.1 Term Neighborhoods . . . . . . . . . . . . . . . . . . . . . 76
6.2.2 Semantic Similarity Calculation . . . . . . . . . . . . . . . 77
6.3 Stage III: Interface Design Data Presentation . . . . . . . . . . . 79
6.4 Summary of Technology Choices . . . . . . . . . . . . . . . . . . . 80
7 Implementation 83
7.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Search Entry Form . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 Handling the Input Query . . . . . . . . . . . . . . . . . . . . . . 88
7.3.1 Typing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3.2 Querying the Database . . . . . . . . . . . . . . . . . . . . 88
7.3.3 Ranking and Grouping of Search Results . . . . . . . . . . 89
7.3.4 Return-key or Mouse-click Search . . . . . . . . . . . . . . 91
7.3.5 Auto-completion Search . . . . . . . . . . . . . . . . . . . 91
7.4 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.5 Term Information Presentation . . . . . . . . . . . . . . . . . . . 96
7.6 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8 Evaluation 103
8.1 Testing the Failed Queries . . . . . . . . . . . . . . . . . . . . . . 103
8.2 Comparison to BioPortal Search Services . . . . . . . . . . . . . . 109
5
8.2.1 Auto-completion . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.2 Results Ranking . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2.3 Error Correction . . . . . . . . . . . . . . . . . . . . . . . 113
8.2.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.3 Comments from an AstraZeneca Search Specialist . . . . . . . . . 117
9 Conclusions and Future Work 121
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography 123
Number of Words in the Document: 25648
6
University of Manchester
School of Computer Science
Degree Programme of Advanced Computer Science
ABSTRACT OF
MASTER’S THESIS
Author: Christos Karaiskos
Title: Enhanced Ontological Searching of Medical Scientific Information
Supervisors: Prof. Andrew Brass (University of Manchester)
Dr. Jennifer Bradford (AstraZeneca)
Abstract: An enormous amount of biomedical knowledge is encoded in narra-
tive textual format. In an attempt to discover new or hidden knowledge, exten-
sive research is being conducted to extract and exploit term relationships from
plain text, with the aid of technology. A common approach for the identification
of biomedical entities in plain text involves usage of ontologies, i.e., knowledge
bases which provide formal machine-understandable representations of domains
of variable specificity. In addition to term extraction, ontologies may be used
as controlled vocabularies or as a means for automatic knowledge acquisition
through their inherent inference capabilities. Visualization of the content of on-
tologies is, thus, very important for researchers in the biomedical domain. Un-
fortunately, many of these researchers find it difficult to deal with formal logic
and would prefer that ontology search interfaces completely hide any structural
or functional references to ontologies. This thesis proposes a strategy for build-
ing a web-based ontology search application that exploits ontologies behind the
scene, transparently from the end user, and presents relevant concept informa-
tion in such a way that searchers can successfully and quickly find what they
are looking for. The proposed search interface features various search tools for
enhanced ontological searching, including term auto-completion, error correction,
clever results ranking, and similar term visualizations based on semantic similar-
ity metrics. Evaluation of the developed application shows that its features can
improve enterprise-strength ontology search applications, such as BioPortal.
Keywords: search interface design, ontology hiding, biomedical ontology,
semantic similarity, usability, data integration
7
8
Declaration
No portion of the work referred to in the dissertation has been submitted in
support of an application for another degree or qualification of this or any other
university or other institute of learning.
9
10
Intellectual Property Statement
i. The author of this dissertation (including any appendices and/or schedules
to this dissertation) owns certain copyright or related rights in it (the ‘Copy-
right’) and he has given The University of Manchester certain rights to use
such Copyright, including for administrative purposes.
ii. Copies of this dissertation, either in full or in extracts and whether in hard
or electronic copy, may be made only in accordance with the Copyright,
Designs and Patents Act 1988 (as amended) and regulations issued under
it or, where appropriate, in accordance with licensing agreements which the
University has entered into. This page must form part of any such copies
made.
iii. The ownership of certain Copyright, patents, designs, trade marks and other
intellectual property (the ‘Intellectual Property’) and any reproductions of
copyright works in the dissertation, for example graphs and tables (‘Repro-
ductions’), which may be described in this dissertation, may not be owned by
the author and may be owned by third parties. Such Intellectual Property
and Reproductions cannot and must not be made available for use with-
out the prior written permission of the owner(s) of the relevant Intellectual
Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication
and commercialisation of this dissertation, the Copyright and any Intel-
lectual Property and/or Reproductions described in it may take place is
11
available in the University IP Policy (see http://documents.manchester.ac.
uk/display.aspx?DocID=487), in any relevant Dissertation restriction decla-
rations deposited in the University Library, The University Library’s reg-
ulations (see http://www.manchester.ac.uk/library/aboutus/regulations)
and in The University’s Guidance for the Presentation of Dissertations.
12
Acknowledgements
I am deeply grateful to my supervisors, Prof. Andrew Brass (University of Manch-
ester) and Dr. Jennifer Bradford (AstraZeneca), for their invaluable guidance and
support throughout the duration of this project. I have greatly benefited from
experiencing the different perspectives of academia and industry, which have both
contributed to shaping the final outcome of this project.
I would like to thank Sebastian Philipp Brandt (University of Manchester),
for his suggestions on making the search application even better. Also, I would
like to express my gratitude to Julie Mitchell (AstraZeneca), for taking the time
to evaluate the application, and Paul Metcalfe (AstraZeneca), for his advice on
improving the performance and security of the application.
Finally, I would like to thank Matina for her patience and love, and my par-
ents, Ioannis and Stavroula, for always being there.
13
14
List of Abbreviations
AI Artificial Intelligence
AJAX Asynchronous JavaScript and XML
API Application Programming Interface
CSS Cascading Style Sheets
DAG Directed Acyclic Graph
HLGT High Level Group Term
HLT High Level Term
HTTP Hypertext Transfer Protocol
IC Information Content
ICD International Classification of Diseases
JDBC Java Database Connectivity
JSON JavaScript Object Notation
LCS Least Common Subsumer
MedDRA Medical Dictionary for Regulatory Activities
NCIT National Cancer Institute Thesaurus
NDF-RT National Drug File Reference Terminology
15
NHS UK National Health System
NLP Natural Language Processing
OBO Open Biomedical Ontologies
OWL Web Ontology Language
PHP PHP Hypertext Preprocessor
PT Preferred Term
RDF Resource Description Framework
RDF-S Resource Description Framework Schema
REST Representational State Transfer
RF2 Release Format 2
SNOMED CT Systematized Nomenclature of Medicine Clinical Terms
SNOMED RT Systematized Nomenclature of Medicine Reference
Terminology
SOC System Organ Class
UMLS Unified Medical Language System
URI Uniform Resource Identifier
URL Uniform Resource Locator
UX User Experience
VA U.S. Department of Veterans Affairs
WHO World Health Organization
XHTML Extensible HyperText Markup Language
XML Extensible Markup Language
16
List of Tables
5.1 Documented failed queries and suggested reasons for failure. . . . 66
5.2 Documented failed queries and suggested reasons for failure (cont.). 67
6.1 ‘Ontologies’ database table structure . . . . . . . . . . . . . . . . 71
6.2 Examples of URI formats for BioPortal RESTful services. . . . . . 73
6.3 Technology choices for the project. . . . . . . . . . . . . . . . . . 81
7.1 PHP files used in the search application. . . . . . . . . . . . . . . 85
7.2 XHTML files used in the search application. . . . . . . . . . . . . 85
7.3 CSS files used in the search application. . . . . . . . . . . . . . . 86
7.4 JavaScript files used in the search application. . . . . . . . . . . . 86
8.1 Testing previously failed queries. . . . . . . . . . . . . . . . . . . . 105
17
18
List of Figures
2.1 The structure of the MedDRA terminology comprises a fixed-depth
hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 The google search engine entry form. . . . . . . . . . . . . . . . . 57
4.2 Facebook uses grayed-out descriptive text to help in the formula-
tion of user queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Bing’s search interface features a powerful dynamic search sugges-
tion, where prefixes are highlighted with grayed-out font and the
remaining text is in bold. . . . . . . . . . . . . . . . . . . . . . . 58
4.4 The Safari browser’s embedded search interface explicitly states
which queries are suggestions and which belong to the user’s recent
search history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 The Firefox browser’s embedded search interface contains recent
queries on top, and separates them from suggestions using a solid
line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Google’s search results page is a typical scrollable vertical list of
captions. Metadata facets, that restrain results to a particular
type of information, are also present in the interface (e.g. ‘Images’
tab). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Amazon’s search interface provides facets as a left panel to the
results page, helping the user dynamically refine the initial search. 62
19
4.8 Pubmed’s results page includes term expansion in two ways. On
the right of the screen, there is a ‘Related searches’ panel that pre-
serves the initial query and adds a new related term to it. Also,
right below the entry form there is a ‘See also’ feature which sug-
gests complete or partial modifications in the initial query. . . . . 64
6.1 A part of the XML response for the ‘get all terms’ query of Table
6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 The provided methods of the ontoCAT API Adamusiak et al. (2011). 75
6.3 Populating the ‘Ontologies’ database is performed with the help of
the ontoCAT API. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1 The organization of the files that comprise the web application.
These files are responsible for the presentation, styling and inter-
active behavior of the web application. . . . . . . . . . . . . . . . 84
7.2 The main window of the search application. The search box is
placed at the top of the screen, with central horizontal alignment.
A submit button labeled ‘Search’ is also provided, to assist users
that prefer mouse-clicking. . . . . . . . . . . . . . . . . . . . . . . 87
7.3 Once the user clicks inside the search box, the grey help message
disappears and a blinking cursor takes its place. . . . . . . . . . . 87
7.4 Terms, that would appear on their own table row, are grouped
under a more lexically-matching term to the query, when their
semantic similarity to that term is higher than a threshold. . . . . 90
7.5 Pressing the ‘Return’ key or clicking the ‘Search’ button submits
the query to index.php and a table of search results is added to the
interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6 Part of the JSON response from performQuery.php, for the input
query ‘rash’. Each JSON object represents a term matching the
query, and contains information that can be used for its presentation. 93
20
7.7 Pressing any other key except ‘Return’ submits the query through
AJAX to performQuery.php and an auto-completion pop-up menu
is created from the JSON response. . . . . . . . . . . . . . . . . . 93
7.8 Error correction when input query is ‘lyng’. The closest term is
suggested, as a clickable link. . . . . . . . . . . . . . . . . . . . . 95
7.9 When the user places the mouse cursor on a circle, a tooltip imme-
diately appears, containing the full term name and the semantic
similarity score with the viewed term. . . . . . . . . . . . . . . . . 97
7.10 Presentation page for the NCIT term ‘Recurrent NSCLC’. On the
left side, the basic term information is shown, along with an XML
representation of highly similar terms. On the right side, a visual-
ization of highly similar terms is provided, using the D3 JavaScript
library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.11 Presentation page for the MedDRA term ‘Rash’. The term has
very close relations with terms that are not in the hierarchy. This
is illustrated using blue color. . . . . . . . . . . . . . . . . . . . . 100
7.12 The XML representation of a term. It includes basic term infor-
mation and highly similar terms. . . . . . . . . . . . . . . . . . . 101
7.13 Help is provided through tooltips that activate on mouse-over. . . 101
8.1 The term ‘DIHS’ is not found, but this is normal, since it is not
part of any of the supported ontologies. Instead, the term ‘DIOS’
is proposed, in case the user had mispelt the query. . . . . . . . . 106
8.2 The term ‘NMDA Antagonist’ is not found, but this is normal,
since it is not part of any of the supported ontologies. No soundex
match is found, so no error corrections are suggested. . . . . . . . 106
8.3 The term ‘Hepatotoxicity’ is shown in the auto-completion dialogue.106
8.4 The term ‘NSCLC’ is shown in the auto-completion dialogue. . . . 106
8.5 The term ‘DRESS syndrome’ is shown in the auto-completion di-
alogue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
21
8.6 The query ‘LHRH’ produces two different 100%-matching results.
Unlike in the previous search application, the user can now see that
‘Gonadotropin Releasing Hormone’ is a preferred term for ‘LHRH’. 107
8.7 The results for the query ‘VEGFR’, illustrate a semantic grouping
of 4 similar terms, namely ‘VEGFR’, ‘Vascular Endothelial Growth
Factor Receptor 1’, ‘Vascular Endothelial Growth Factor Receptor
2’, ‘Vascular Endothelial Growth Factor Receptor 3’. The latter
three are grouped under the parent term. . . . . . . . . . . . . . . 108
8.8 The BioPortal interface is a simple text box, similar to this project’s
main page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.9 BioPortal also offers advanced options to improve the search results.110
8.10 Only NCIT, MedDRA and ICD9CM are chosen for searching, out
of the 353 ontologies offered by BioPortal, so that comparisons to
this project’s work are achievable. . . . . . . . . . . . . . . . . . . 111
8.11 Auto-completion pop-up menu of BioPortal NCIT widget when
the user has typed ‘nsc’. Only preferred terms are shown. The
user might be confused when seeing the term ‘Becatecarin’ in the
results, since it does not contain ‘nsc’. . . . . . . . . . . . . . . . . 112
8.12 Auto-completion pop-up menu of this project’s search application
when the user has typed ‘nsc’. . . . . . . . . . . . . . . . . . . . . 112
8.13 Searching for ‘Denatonium Benzoate’ through its preferred term
name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.14 Searching for ‘Denatonium Benzoate’ through its synonym ‘THS-
839’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.15 Searching for ‘Denatonium Benzoate’ through its synonym ‘WIN
16568’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.16 BioPortal search results rankings for ‘nsclc’. All terms are grouped
according to the ontology they belong to, under the preferred name
of the most lexically-relevant term to the query. . . . . . . . . . . 114
22
8.17 This project’s search results rankings for ‘nsclc’. Terms in the re-
sults are rearranged into groups that show high semantic similarity.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.18 BioPortal returns no search results for the erroneously spelt term
‘nsclca’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.19 BioPortal returns no search results for the erroneously spelt term
‘caancer’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.20 This project’s search application returns a search suggestion of
‘nsclc’ for the erroneously spelt term ‘nsclca’. . . . . . . . . . . . 116
8.21 This project’s search application returns a search suggestion of
‘cancer’ for the erroneously spelt term ‘caancer’. . . . . . . . . . 116
8.22 BioPortal uses a graph to visualize hierarchical relations. Edges
are annotated with a description of the relationship between the
connected nodes (e.g. subclassOf). . . . . . . . . . . . . . . . . . 116
8.23 This project’s application focuses on inexperienced users and at-
tempts to completely hide any formal-logic relationships that might
confuse the user. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.24 Search results depicting causal associations between smoking and
cancer, as presented by the I2E text mining application. . . . . . 118
8.25 Search results for the term ‘MEK inhibitor’ in NCIT, when the
I2E application is used. . . . . . . . . . . . . . . . . . . . . . . . . 119
23
24
Chapter 1
Introduction
Ontologies are knowledge bases which provide formal machine-understandable
representations of domains of variable specificity. Given a domain of discourse,
concepts that belong to the domain are well documented in formal logic, along
with their inter-relations. Ontologies, as representations, cannot perfectly capture
the part of the world that they attempt to describe Davis et al. (1993). They
are based on the open world assumption, which states that if something is not
represented in a knowledge base, it does not mean that it does not exist in the
real world Hustadt et al. (1994). As our knowledge about a domain increases,
ontologies are updated and they become more complex. This has become evident
in the biomedical domain, where ontologies have already attained a high degree of
specificity, and has led to their quick adoption for data integration and knowledge
discovery purposes.
1.1 Problem Context
Within biomedicine, ontologies can help researchers communicate, by promoting
consistent use of biomedical terms and concepts. The construction of an ontol-
ogy itself involves mediating across multiple views and requires that a number
of domain experts reach a consensus that reflects the diverse viewpoints of the
25
CHAPTER 1. INTRODUCTION
community. Ontologies are viewed as tools that provide opportunities for new
knowledge acquisition, due to the complex semantic relations that they model.
Inferences in a huge ontology may reveal connections that the human eye would
bypass. This is especially important in the pharmaceutical sector, where drug
discovery has slowed down significantly as a process and in the biological sector,
where attempts to demystify genome patterns associated with disease are still
at initial stage. Another common use for ontologies in the biomedical domain
is as controlled vocabularies that feed filtered terms into computer applications.
Finally, ontologies may be used to connect terms found in plain text to their
semantic representations. Term extraction with the help of ontologies is a hot
topic in biomedicine, due to the vast amounts of medical information stored in
plain text. Due to the importance of ontologies, it is usual for researchers in the
biomedical field to require access to their content.
1.2 Motivation
In the past, AstraZeneca employees were provided with a web-based search form
that enabled them to look for concepts in one or more biomedical ontologies and
select the most suitable from a list of search results. The chosen concepts were, in
turn, conveyed to a text mining application. Understanding the results required
the user to be familiar with the content and structure of the ontology from which
the terms were retrieved. Unfortunately, most users did not feel comfortable
with the idea of ontologies and struggled, or even refused, to use the provided
interfaces, even though no logic-based content was there to confuse them.
In many cases, though, this was not solely the fault of the users. The interface
gave the users freedom to select the ontologies to be searched for the specified
query. Inexperienced users usually did not know or care about which ontology
contains the desired query term. For example, a user wished to search for ‘Non-
small cell lung carcinoma’, by its abbreviation ‘NSCLC’. Querying ‘NSCLC’ in
26
1.3. CONTRIBUTION
the MedDRA terminology1 returned no results, since the concept is not present
in the terminology. Although this behavior is correct, it seems wrong to the
inexperienced user and may lead to loss of trust to the system.
But even if the term is present in the ontology, the user should not be forced
to know its exact spelling. For example, querying for ‘NSCLC’ in the NCIT
thesaurus also returned no results, despite the fact that the actual concept exists
in the ontology. The searcher needed to know that the preferred term for the
‘NSCLC’ concept is ‘Non-small cell lung carcinoma’. Abbreviations and dissimilar
synonyms are common in the biomedical field, so expecting the user to know the
preferred term for each concept is considered problematic.
In addition to the above, presentation of results was not always straightfor-
ward. Terms that demonstrate a strong semantic relation to each other were
presented as stand-alone terms in the search results, subconsciously misleading
users to deduce that the terms were independent. It was up to the user to judge
the relevance of results to the query. For example, the results for ‘Non-small cell
lung carcinoma’ in NCIT included, among others, the terms ‘Non-small cell lung
carcinoma’ and ‘Stage I non-small cell lung carcinoma’ equally spaced, in a way
that users could not infer the connections between them. In fact, the latter term
is a specification of the former. In reality, what users did was to choose all terms,
even though they were looking for the broad term, because they became confused
and did not want to take the risk of selecting only one.
This collapse at the human-computer interface has motivated AstraZeneca to
try to build tools that take advantage of the ontology structure and, at the same
time, completely hide it from the user in order to facilitate the search procedure.
1.3 Contribution
The outcome of this thesis is the development of a user-friendly search applica-
tion that allows users to find information about concepts present in a medical
1The difference between terminology and ontology is described in Section 2.2
27
CHAPTER 1. INTRODUCTION
ontology, without requiring from them to understand the underlying structure of
the ontology. Information about a concept includes its accession code within the
given ontology, the term for its preferred name, its definition and all available
synonym terms. In order to facilitate the search procedure and enhance User
Experience (UX), the search application includes features such as dynamic term
suggestion, spelling correction and similar term visualization tools.
The main challenge lies in the presentation of results; as stated in section 1.2,
users are usually not sure about which term(s) to choose, when multiple similarly-
spelt terms appear. Ranking of terms is performed with the aid of both lexical
and semantic similarity. The former screens those terms that best match the user
query and ranks them according to a string relevance metric. These results are
processed by the latter, so that terms showing a strong semantic connection are
grouped together.
Ideally, the search application should bridge across terms from multiple ontolo-
gies. Due to the diversity in the format and annotation of different ontologies, this
is not a straightforward generalization. Most importantly, within the biomedical
society, the term ‘ontology’ is often used erroneously to describe plain termi-
nologies that, in fact, violate basic ontological principles.2 Therefore, ontology-
specific difficulties are expected to arise, if semantic similarity measures are to be
deployed.
In summary, the goals of this thesis are to investigate the following topics:
1. To develop user-friendly search tools that allow users to build search queries
based on the terms present in a medical ontology, without need for the users
to understand the actual structure of the ontology.
2. To exploit the semantic annotations of the underlying ontology in order to
enhance the quality and presentation of results.
3. To intermix results originating from different ontologies.
2In MedDRA, the synonym of a term may be a child node of the term itself.
28
1.4. THESIS ORGANIZATION
1.4 Thesis Organization
The thesis is organized in a total of 9 chapters. Chapter 2 includes an introduction
to ontologies and a brief description of some notable biomedical ontologies. Chap-
ter 3 presents the background needed for understanding the different measures
of lexical and semantic similarity. Chapter 4 discusses interface design principles
for user-centered search applications. In chapter 5, the requirements and feature
specifications for the final search application are addressed. Chapter 6 describes
the design considerations that were taken into account for the ontological search
application, while chapter 7 presents the final implementation. Chapter 8 in-
cludes the evaluation of the search application. Finally, conclusions are drawn in
chapter 9, along with possible future directions.
29
30
Chapter 2
Ontologies
The term ‘ontology’ is an uncountable noun coined in the philosophical field, by
ancient Greek philosophers Guarino (1998). It involves the study of the nature
of existence, at a fairly abstract level. In the world of computer science, the word
‘ontology’ refers to the encoding of human knowledge in a format that allows
for computational use. This chapter includes an introduction to the modern
definition of ontology, along with a brief description of some of the most notable
biomedical ontologies.
2.1 Modern Ontology Definition
In Artificial Intelligence (AI), an ontology is commonly defined as a specification
of a (shared) conceptualization Gruber et al. (1995). A conceptualization refers
to an individual’s knowledge about a specific domain, acquired through ‘expe-
rience, observation or introspection’ Huang et al. (2010). Ontologies are shared
conceptualizations, meaning that multiple participants, usually domain experts,
contribute to their construction, maintenance and expansion. Conflicts are cer-
tain to arise among the different participants, so an important aspect of ontology
design is to bridge across multiple views of the desired domain into a single con-
crete representation. On the other hand, a specification is a transformation of
31
CHAPTER 2. ONTOLOGIES
this shared conceptualization into a formal representation language.
The outcome of a formal representation of a domain is a collection of entities,
expressions and axioms. Entities include:
• concepts or classes, which are sets of individuals (e.g., ‘Country’, which
contains all countries),
• individuals, which are specific instances of classes (e.g., ‘Greece’ as an in-
stance of ‘Country’),
• data types (e.g. string, integer),
• literals, which are specific values of a given data type (e.g. 1,2,3, or string
values),
• properties (e.g. hasDisease, hasAge).
Expressions refer to descriptions of entities in a formal representation language.
The standardized family of languages for formal ontology representation is the
Web Ontology Language (OWL), which builds on the Extensible Markup Lan-
guage (XML), Resource Description Framework (RDF) and RDF-Schema (RDF-
S) standards to provide a highly expressive means for representing knowledge
McGuinness et al. (2004). The underlying format of the resulting OWL docu-
ment can vary among several types, with the most common being RDF/XML.
Finally, axioms relate entities/expressions. This connection can be made
class-to-class (i.e. SubClassOf), individual-to-class (i.e. ClassAssertion), property-
to-property (i.e. SubPropertyOf), among others. These relations can be asserted
explicitly or inferred by a reasoner. Inferences are made, based on the logic rela-
tions of concepts. As an example of a simple inference, a concept’s ancestors can
be inferred automatically, once the parent concept is specified.
An ontology may be visualized as a graph, in which concepts are nodes and
relations are edges between nodes. Furthermore, if transitive hierarchical rela-
tions are isolated (e.g. subsumption, also known as ‘is-a’ relation or hyponymy),
32
2.2. ONTOLOGY VS. TERMINOLOGY
the ontology can be viewed as a taxonomy. The geometrical visualization of an
ontology will be presented in more detail in chapter 3.
2.2 Ontology vs. Terminology
A terminology is a collection of term names that are associated with a given
domain. A term is a mapping of a concrete concept to natural language. This
term-to-concept mapping is usually not one-to-one, especially in the biomedical
domain where term variation and term ambiguities arise Ananiadou and Mc-
Naught (2006). Term variation is a result of the richness of natural language and
refers to the existence of multiple terms for the description of the same concept.
For example, the terms ‘Transmembrane 4 Superfamily Member 1’, ‘TM4SF1t’,
‘L6 Antigen’ all point to the same protein. Term ambiguity occurs when a term is
mapped to more than one distinct concept. This is common when new abbrevia-
tions are introduced Liu et al. (2002). As an example, some of the concepts that
the acronym ‘CTX’ may map to are ‘Cardiac Transplantation’, ‘Clinical Trial
exemption’ and ‘Conotoxin’. Their disambiguation is a matter of context.
A terminology is not constrained to being a simple list of terms. In fact,
most terminologies feature some kind of structure, where terms that map to the
same concept are grouped together and semantic relationships between concepts
are explicitly or implicitly stated. Semantic relationships between terms include
synonymy and antonymy, while semantic relationships between concepts include
hyponymy, hypernymy, meronymy and holonymy Jurafsky and Martin (2000).
Synonymy exists when two terms are interchangeable, while antonymy denotes
that two terms have opposite meaning. Hyponymy introduces a parent-child, or
‘is-a’ relation between concepts. A concept is a hyponym of another concept,
if the former derives from the latter and it represents a more granular concept.
Hyponymy is transitive; if concept a is a child of concept b, and concept b is a
child of concept c, then a is also a child of c. Hypernymy is the reverse relation
of hyponymy. Meronymy exists when a concept represents a part of another
33
CHAPTER 2. ONTOLOGIES
concept. Holonymy is the opposite relation, where a concept has part some other
concept(s).
The difference between a terminology and an ontology is not always clear, as
terminologies continue to improve their state of organization in a way that resem-
bles ontologies. The initial scope and aim of the two, though, is clearly different;
the purpose of a terminology was initially, as the name implies, an effort to collect
all terms associated with a specified domain. On the other hand, the target of
an ontology has, from the start, been to provide a machine-readable specification
of a shared conceptualization. Despite their many common characteristics, ter-
minologies are not necessarily ontologies. If treated as ontologies, they may lead
to inconsistencies or wrong inferencing mechanisms Ananiadou and McNaught
(2006). An illustrative example is the case of MedDRA, which will be discussed
in Section 2.3.4.
2.3 Notable Biomedical Ontologies and Termi-
nologies
Hundreds of biomedical ontologies and terminologies have been published on-
line. According to BioPortal1 statistics, the top five most viewed ontologies or
terminologies are SNOMED Clinical terms, National Drug File, International
Classification of Diseases, MedDRA and NCI Thesaurus. In this section, a brief
introduction to these ontologies/terminologies is performed.
2.3.1 SNOMED CT
The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is a
biomedical terminology which covers most areas within medicine such as drugs,
diseases, operations, medical devices and symptoms. It may be used for the cod-
1BioPortal is a biomedical ontology/terminology repository which provides online ontology
presentation and manipulation tools (http://bioportal.bioontology.org/).
34
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
ing, retrieval and processing of clinical data. SNOMED CT is written purely in
formal logic-based syntax (i.e., the so-called Release Format 2 or RF2) available
and organized into multiple independent hierarchies. It is the result of the merg-
ing between the UK National Health System’s (NHS) Read codes and SNOMED
Reference Terminology (SNOMED-RT), developed by the College of American
Pathologists. The basic hierarchies, or axes, are ‘Clinical Finding’ and ‘Proce-
dure’. The last version contains more than 400000 concepts and over 1000000
of relationships, rendering SNOMED CT the most complete terminology in the
medical domain. Only few definitions are present in the terminology. Each con-
cept contains a unique identifier and numerous synonymous terms that account
for term variation. Also, each concept is part of at least one hierarchy and may
have multiple ‘is-a’ relationships with higher level nodes. SNOMED CT is part
of the Unified Medical Language System (UMLS), a biomedical ontology and
terminology integration attempt which comprises hundreds of resources.
2.3.2 NDF-RT
The National Drug File Reference Terminology (NDF-RT) was introduced by the
U.S. Department of Veterans Affairs (VA) as a formalized representation for a
medication terminology, written in description logic syntax VHA (2012). The
terminology is organized into concept hierarchies, where each concept is a node
comprising a list of term synonyms and a unique identifier. As expected, top-level
concepts are more general than lower-level ones. The central hierarchy is named
DRUG KIND and indicates the types of medications, the preparations used in
them and clinical VA drug products. Other hierarchies include
• DISEASE KIND,
• INGREDIENT KIND,
• MECHANISM OF ACTION KIND,
• PHARMACOKINETICS KIND,
35
CHAPTER 2. ONTOLOGIES
• PHYSIOLOGIC EFFECT KIND,
• THERAPEUTIC CATEGORY KIND,
• DOSE FORM and
• DRUG INTERACTION KIND.
Roles exist between different concepts, and are specified only with existential
restrictions (i.e. OWL equivalent of someValuesFrom). Mappings to other ter-
minologies are also available. Currently, NDF-RT more than 45000 concepts in
hierarchies of maximum depth 12.
2.3.3 ICD-10
The International Statistical Classification of Diseases and Related Health Prob-
lems (ICD) is a terminology which attempts to classify signs, symptoms and
causes of disease and morbidity WHO (1992). It appeared in the mid-19th cen-
tury and is now maintained by the World Health Organization (WHO). Currently
it is available in its 10th revision, although the 11th version is claimed to be at
the final stage before release. As a taxonomy, it has relatively small maximum
depth, equal to 6. Codes assigned to each concept tie it to a specific place in the
taxonomy, with each code having only a single parent. It is thus not a proper ap-
plication of ontological principles2, since, in reality, it is not unusual for concepts
to belong to more than one subsumers, and this is not modeled. In addition to
that, there exist categories such as ‘Not otherwise specified’ or ‘Other’, which are
not needed in an ontology; the open world assumption already covers the fact
that every ontology is incomplete, so stating it explicitly is redundant and may
interfere with the evolution of the ontology, as new terms are not classified under
their closest match.
2nor was meant to be; its intent is classification
36
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
Figure 2.1: The structure of the MedDRA terminology comprises a fixed-depth hierarchy.
2.3.4 MedDRA
The Medical Dictionary for Regulatory Activities (MedDRA) is a terminology
that is concerned with biopharmaceutical regulatory processes. It contains terms
associated with all phases of the drug development cycle. MedDRA is organized
in a hierarchical structure of fixed depth, as seen in Fig. 2.1. System Organ
Classes (SOCs) represent the 26 predefined overlapping hierarchies in which terms
belong to. High Level Group Terms (HLGTs) and High Level Terms (HLTs) are
general term groupings, denoting disorders or complications. Preferred Terms
(PTs) denote the preferred name for a concept, while Lowest Level Terms (LLTs)
include terms of maximum specificity. LLTs may be connected with hyponymy,
meronymy or synonymy relationships to their PTs. This is the main problem in
trying to view MedDRA as an ontology. In a formal ontology, a concept cannot
be a child of itself. In MedDRA, this clearly happens, when a PT and its LLTs
share a synonymy relation.
37
CHAPTER 2. ONTOLOGIES
2.3.5 NCI Thesaurus
The National Cancer Institute Thesaurus (NCIT) is a controlled terminology
for cancer research. The thesaurus has been converted to formal OWL syntax
and is updated at fixed intervals. The conversion was not an easy one; many
inconsistencies and modeling dead-ends that were encountered in the conversion
procedure have been documented Ceusters et al. (2005), along with some clear
violations of ontological principles Schulz et al. (2010). The NCIT provides almost
100000 concepts, with approximately 65% containing a definition.
38
Chapter 3
Similarity Metrics
Similarity metrics aim at measuring the lexical or semantic similarity between
terms. Lexical similarity focuses on terms that contain similar character or word
sequences, while semantic similarity tries to determine how close in meaning the
terms are. Lexical similarity is simpler to calculate, since string-based algorithms
only require plain text to function. On the other hand, semantic similarity re-
quires extra information about the terms present in plain text. This extra in-
formation is usually acquired with the help of a knowledge base (e.g. ontology,
terminology, etc.) or through statistical analysis of corpora, i.e., large collections
of text documents that resemble real-world usage of words.
3.1 Similarity Metric vs. Distance Metric
It is common in literature to come across the term ‘semantic distance’, instead
of ‘semantic similarity’. A distance metric d(a, b), that compares entities a and
b, must satisfy the following properties:
1. d(a, b) = 0 if and only if a = b (zero property),
2. d(a, b) = d(b, a) (symmetric property),
3. d(a, b) ≥ 0 (non-negativity property),
39
CHAPTER 3. SIMILARITY METRICS
4. d(a, b) + d(b, c) ≤ d(a, c) (triangular inequality).
On the other hand, the requirements for a similarity metric were formally intro-
duced not long ago Chen et al. (2009). The definition states that a similarity
metric s(a, b) must satisfy the following properties:
1. s(a, a) ≥ 0,
2. s(a, b) = s(b, a),
3. s(a, a) ≥ s(a, b),
4. s(a, b) + s(b, c) ≤ s(a, c) + s(b, b),
5. s(a, a) = s(b, b) = s(a, b) if and only if a = b.
The counter-intuitive 4th property can be proven, using set theory. More specif-
ically, if |a ∩ b| denotes the cardinality of common characteristics between a and
b, and c denotes the complement of c, the following equality holds:
|a ∩ b| = |a ∩ b ∩ c|+ |a ∩ b ∩ c|. (3.1)
Then,
|a∩ b|+ |b∩ c| = |a∩ b∩ c|+ |a∩ b∩ c|+ |a∩ b∩ c|+ |a∩ b∩ c| ≤ |a∩ c|+ |b|, (3.2)
since |a∩ b∩ c| ≤ |a∩ c| and |a∩ b∩ c|+ |a∩ b∩ c|+ |a∩ b∩ c| ≤ |b|. Deduction of
similarity from distance is a common procedure that requires simple operations.
Similarity is, intuitively, a decreasing function of distance. Conversion between
the two can take many forms Chen et al. (2009). In this thesis, all formulas will
be presented as similarity measures.
3.2 Lexical Similarity
String-based methods that calculate lexical similarity can be divided into character-
based and word-based. In this section, some of the most popular metrics are
presented. For a more complete survey of lexical similarity measures see Navarro
(2001) and Gomaa and Fahmy (2013).
40
3.2. LEXICAL SIMILARITY
3.2.1 Character-based Similarity Measures
In character-based similarity, strings are viewed as character sequences and at-
tempts are made to discover character relevance.
Longest Common Substring
The Longest Common Substring algorithm Gusfield (1997) tries to find the max-
imum number of consecutive characters that two strings share. It may be imple-
mented using a suffix tree or dynamic programming.
Hamming Similarity
Hamming similarity is a metric that can be applied to strings of equal length. It
is a simple metric that measures the number of common characters between two
strings. Given strings a and b, the formula for string similarity can be constructed
as follows:
simham(a, b) =
∑∀i
1(ai = bi)
|a|, (3.3)
where 1(·) is the indicator function and | · | denotes string length, measured in
characters.
Levenshtein Similarity
Levenshtein distance counts the number of character alterations that need to
be made in order to transform one string to another Levenshtein (1966). This
number is bounded by the length of the larger string, which is commonly used as a
normalizing measure that restrains the value of distance to [0, 1]. Mathematically,
normalized Levenshtein distance of terms a and b is computed using the following
formula:
dlev(a, b) =leva,b(|a|, |b|)max|a|, |b|
, (3.4)
41
CHAPTER 3. SIMILARITY METRICS
where | · | denotes string length in number of characters,
leva,b(i, j) =
maxi, j , if mini, j = 0
min
leva,b(i− 1, j) + 1
leva,b(i, j − 1) + 1
leva,b(i− 1, j − 1) + [ai 6= bj]
, else(3.5)
and max·, min· denote the maximum and minimum functions, respectively.
Converting normalized distance to similarity can be done as follows:
simlev(a, b) = 1− dlev(a, b). (3.6)
Jaro Similarity
Jaro similarity Jaro (1989, 1995) takes into account both the number and sequence
of common characters present in the two strings. Let us consider strings a =
a1 . . . aK and b = b1 . . . bL. A character ai is said to be common with b if the
character exists in b within a window of min|a|,|b|2
from bi. Let a′ = a′1 . . . a′K′ be
those characters in a that are common with b, and b′ = b′1 . . . b′L′ those characters
in b that are common with a. A transposition for a′, b′ is a position i in the strings
a′, b′ in which a′i 6= b′i. The number of transpositions for a′, b′ divided by two is
denoted as Ta′,b′ . Then, Jaro’s formula for similarity is given by:
simjaro(a, b) =1
3
(|a′||a|
+|b′||b|
+|a′| − Ta′,b′|a′|
). (3.7)
It should be noted that Jaro similarity violates the symmetry property of Eq.
3.1, therefore it is not a true similarity metric, according to that definition.
Jaro-Winkler Similarity
Jaro-Winkler similarity Winkler (1999) is a variation of Jaro similarity which
promotes strings with long common prefixes. The length of the longest prefix
common to both strings a and b is denoted as P . Then, if P ′ = max(P, 4),
42
3.2. LEXICAL SIMILARITY
Jaro-Winkler similarity is given by:
simj&w(a, b) = simjaro(a, b) +P ′
10(1− simjaro(a, b)). (3.8)
N-gram Similarity
A string can be split into n-grams, i.e. all possible consecutive character sequences
of length n in the string. As an example, the word ‘protein’ can be split into the 3-
grams ‘pro’, ‘rot’, ‘ote’, ‘tei’ and ‘ein’. When comparing two strings, the number
of common n-grams is computed and normalized by the maximum number of
n-grams. More specifically, given strings a and b, similarity is given by:
simngram(a, b) =Ncom
Nmax
, (3.9)
where Ncom denotes the number of common n-grams and Nmax denotes the max-
imum number of n-grams in either of the two strings.
3.2.2 Word-based Similarity Measures
As the name implies, word-based measures view the string as a collection of words.
Similarity measures dictate how similar two terms are word-wise, and no weight
is given on character similarity.
Dice Similarity
Dice similarity considers input strings a and b as sets of words A and B respec-
tively, and calculates similarity as follows:
simdice(a, b) =2|A ∩B||A|+ |B|
, (3.10)
where | · | denotes set cardinality in number of words.
43
CHAPTER 3. SIMILARITY METRICS
Jaccard Similarity
Jaccard similarity counts the number of common words of the compared strings
and divides it by the number of distinct words in both strings, i.e.
simjacc(a, b) =|A ∩B||A ∪B|
. (3.11)
Cosine Similarity
In order to compute cosine similarity, the compared strings should be converted to
vectors. The dimension of the resulting vectors will be equal to the total number
of distinct words present in both. Therefore, each element in the vector represents
one word. The vector values for each string are computed as follows: A vector
contains unitary values in positions that correspond to words that are contained
in the respective string. Similarly, a vector contains zero values in all positions
that correspond to words that are not present in the respective string. Given
strings a and b, the respective vectors a and b are computed. Cosine similarity
is then given by:
simcos(a, b) =a · b||a|| ||b||
, (3.12)
where || · || denotes the Euclidean norm function.
Manhattan Similarity
Taxicab geometry considers that distance between two points in a grid is given
by the sum of the absolute differences of their respective coordinates. The grid
resembles a uniform city road map, where diagonal movements are not permitted.
This is the reason why the distance metric in this space is often called Manhattan
distance or city block distance. Considering N -dimension string vectors a and b,
Manhattan distance can be computed as:
simmanh(a, b) = 1−
N∑i=1
|ai − bi|
N, (3.13)
where N is a normalizing constant that represents the dimension of a and b.
44
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
Euclidean Similarity
Euclidean similarity also considers strings as vectors, and computes similarity as:
simeucl(a, b) = 1−
√√√√√ N∑i=1
|ai − bi|2
N. (3.14)
3.3 Ontological Semantic Similarity
An ontology is a collection of concepts and their inter-relationships. It may be
visualized as a graph, in which nodes represent concepts and edges represent the
relations between them. Usually, ontologies are viewed as taxonomies, where ‘is-
a’ and ‘part-of’ relations play the most important role. Viewing the ontology as a
taxonomy, one can apply semantic similarity metrics that exploit the hierarchical
structure. Probably the most famous object of semantic similarity tests is the
computational lexicon WordNet Miller (1995). In WordNet, closely related terms
are grouped together to form synsets. These synsets, in turn, form semantic rela-
tions with other synsets. WordNet is commonly referred to as a lexical ontology,
due to an obvious mapping of lexical hyponymy to ontological subsumption.
3.3.1 Intra-ontology Semantic Similarity
Intra-ontology semantic similarity metrics are meant to measure similarity be-
tween concepts that reside within the same ontology. These metrics can be
roughly divided into distance-based, information-based and feature-based.
Distance-based Metrics
Distance-based metrics take advantage of the ontological topology to compute
the similarity between concepts. This method requires viewing the ontology as
a rooted Directed Acyclic Graph (DAG), in which nodes are concepts and edges
among them are restricted to hierarchical relationships, with the most usual type
45
CHAPTER 3. SIMILARITY METRICS
being ‘is-a’ relationships. At the top, there is a single concept, the root. The graph
is directed, starting from a low-level concept and directed towards its ancestors
through transitive relationships. The graph is also acyclic, since a finite path
from a source node to a destination node cannot return to the source node. In
other words, a node can never be a child of one of its children.
A simple look at an ontology from a geometric perspective may reveal im-
portant information about the similarity of concepts. As depth in the DAG
increases, concepts become increasingly specific, thus similarity is expected to
increase. Another important characteristic of the ontology DAG is that the path
between concepts is not always unique, therefore distance-based similarity will
depend on which path is chosen. Finally, the density of nodes is a good indicator
of similarity; as density increases, concepts approach each other and similarity
increases.
The accuracy of distance-based methods depends on the level of detail that
the ontology captures. A poorly structured ontology with many omissions might
yield misleading similarity results. Fortunately, a lot of effort has been made to
make biomedical ontologies as complete as possible, therefore network density in
biomedical ontologies is usually high.
The most straightforward way to measure the similarity of concept nodes is
given in Rada et al. (1989). In that work by Rada et al., all edges are assigned
a unitary weight and the distance between two concepts is equal to the number
of edges that are present in their shortest path. Let us consider two distinct
concepts c1 and c2 in the hierarchy. Each path i that connects these two concept
nodes may be represented as a set which includes all edges ek present in the path,
i.e.
pathi(c1, c2) = e1, e2, . . . , eK. (3.15)
with cardinality |pathi(c1, c2)| = K. The distance between concepts c1 and c2 is,
then, equal to the shortest path that connects them, i.e.,
drada(c1, c2) = min∀i|pathi(c1, c2)|. (3.16)
46
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
Note that in literature, there are cases (e.g. Al-Mubaid and Nguyen (2006)) where
Rada’s measure is used with node counting, instead of edge counting. In those
cases, each path is represented as a set of the nodes that compose it, including
the end nodes. The minimum distance can be converted into a similarity metric,
as in Resnik (1995):
simrada(c1, c2) = 2D− d(c1, c2), (3.17)
where D is the maximum depth of the taxonomy. This method fails to capture
the intuition that concept nodes, which reside at the lower part of the hierarchy
and are separated by distance d, are more similar than higher-level nodes with the
same distance separation d. Also, its success highly depends on the uniformity of
edge distribution within the ontology. For these reasons, other approaches have
been proposed in order to achieve a more representative score of similarity.
In Wu and Palmer (1994), the relative depth of the compared concepts in the
hierarchy is considered. In that work, Wu and Palmer introduce the Least Com-
mon Subsumer (LCS) of the compared concepts. The LCS is the hierarchically
deepest common ancestor of the compared concepts. Similarity for concepts c1
and c2 is then given as:
simw&p(c1, c2) =2h
N1 +N2 + 2h, (3.18)
where N1 is the number of nodes in the path between concept c1 and the LCS,
N2 is the number of nodes between concept c2 and the LCS, and h is the depth
of the LCS, measured again in number of nodes.
In Li et al. (2003), the authors followed various strategies in their attempt
to calculate similarity as a function of the shortest path between the compared
concepts, the depth of their LCS and the local density of the ontology. They
perceived that the best performance was obtained when they used the following
non-linear function:
simli(c1, c2) = e−α drada(c1,c2)eβh − e−βh
eβh + e−βh, (3.19)
where α, β are non-negative parameters and h = drada(LCS(c1, c2), root) denotes
the minimum depth of the LCS. Distances are measured in number of edges.
47
CHAPTER 3. SIMILARITY METRICS
Al-Mubaid and Nguyen attempt to combine path length and node depth in one
measure. In Al-Mubaid and Nguyen (2006), they view the DAG as a composition
of clusters, with each cluster having as root a child of the ontology root. The
usage of clusters aims to exploit local characteristics of different branches. Given
concepts c1 and c2, they first compute their so-called common specificity:
Cspec(c1, c2) = Dc − h, (3.20)
where Dc denotes the depth of the specific cluster and h refers to the depth of the
LCS in the ontology, with both quantities measured in number of nodes. Then
similarity is computed as:
sima&n(c1, c2) = log((Path− 1)α × (CSpec)β + k), (3.21)
where Path is a modified version of Rada’s distance measure which is adapted
according to the largest cluster, and α, β, k are constants, whose default values
are unitary.
Information-Based Metrics
One of the first attempts to focus on nodes in the similarity formula is that
of Leacock and Chodorow Leacock and Chodorow (1998). This method uses
negative log likelihood in a way that resembles the formula of self-information
Cover and Thomas (2012), but does not really involve valid probability. Instead,
a normalized form of the path length between the concepts is used:
siml&c(c1, c2) = −log(Np/2D), (3.22)
where Np is the number of nodes in the shortest path between concepts c1 and
c2. This variable also includes the end nodes.
Resnik, in Resnik (1995), continues down this path by replacing the normal-
ized path length with a probability measure P(·) to calculate the information
content (IC) of a concept. He considers all common subsumers CSi of concepts
48
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
c1 and c2 and calculates similarity as:
simresn(c1, c2) = max∀i
[−log(P(CSi))], (3.23)
or, equivalently,
simresn(c1, c2) = −log(P(LCS)). (3.24)
Considering that the IC of a concept c is defined as the negative logarithm of its
probability, i.e. IC(c)= -log(P(c)), equation (3.24) can also be written as:
simresn(c1, c2) = IC(LCS(c1, c2)). (3.25)
Probabilities are estimated with the help of a text corpus, i.e. a collection of
nature language excerpts, specifically chosen to provide a good representation of
actual term usage. When dealing with biomedical ontology concepts, collections
of Pubmed1 abstracts are commonly used as corpora to determine the probability
of each concept.
Given a corpus, the occurrence of a term which corresponds to concept c
essentially implies the occurrence of each and every concept that subsumes c
within the ontological structure. Conversely, the number of occurrences of a
concept c depends not only on the number of appearances of c itself in the corpus,
but also on every occurrence of its descendants in the hierarchy. Thus, the number
of occurrences of concept c is given by:
occ(c) =∑
∀n=subsumed(c)
count(n), (3.26)
where subsumed(c) represents c and its children concept nodes, and count(·)
denotes the number of occurrences of the specific concept within the given corpus.
Converting occurrences to probability can be done using:
P(c) =occ(c)
N, (3.27)
where N is the total number of occurrences of ontology terms in the corpus.
This method results to higher probabilities for concepts residing at the top part
1http://www.ncbi.nlm.nih.gov/pubmed
49
CHAPTER 3. SIMILARITY METRICS
of the hierarchy, with the root having unitary probability. Therefore, concepts
whose LCS lies lower in the hierarchy are more similar, since their LCS has low
probability (i.e., high IC).
A possible drawback of this method is that probabilities are tied to the choice
of corpus. So far, in the biomedical domain, there is no widely accepted corpus
that covers the domain needs Al-Mubaid and Nguyen (2006). This is due to the
fact that thousands of new terms and abbreviations appear in the literature every
year, thus a stable corpus might not function well. Since extensions of the corpus
would need to be considered at fixed intervals, it might not serve as a useful
benchmark.
Alternatively, computation of IC can be performed without the use of a corpus,
by solely relying on the structure of the ontology DAG. Intrinsic computation of
IC involves approximating the occurrence probability of a concept as a function
of multiple variables, such as number of descendant nodes, number of subsumers
or number of descendant nodes which are leaves in the ontology. In Seco et al.
(2004), the IC of a concept c is given by:
ICseco(c) = 1− log(descendants(c) + 1)
log(allConcepts), (3.28)
where descendants(c) returns the number of nodes that concept c subsumes, and
allConcepts denotes the number of all the available concepts in the ontology.
The IC function introduced by Seco et. al has the drawback that it assigns IC
equal to one for every leaf node in the ontology, and also that concepts containing
the same number of descendant nodes are again given the same IC. An attempt to
distinguish the IC between leaf concepts was made in Zhou et al. (2008), by also
including the depth of the node in the calculation, normalized by the maximum
depth of the ontology. The proposed IC formula is given by:
ICzhou(c) = kICseco(c) + (1− k)log(depth(c) + 1)
log(maxDepth), (3.29)
where depth(c) represents the depth of the concept c in the hierarchy, maxDepth
is the maximum depth of the ontology, measured in node number and k is a
weighting constant.
50
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
The authors in Sanchez et al. (2011) further improve the modeling of the IC
function. In that work, the IC function can also distinguish concepts that contain
the same number of descendants, due to the fact that the number of subsumers
of a concept is also used. The IC is given as:
ICsan(c) = −log
( leaves(c)ancestors(c)
+ 1)
allLeaves
), (3.30)
where leaves(c) is the number of nodes that are descendants of c and have no
children, ancestors(c) refers to the number of concepts which subsume c and
allLeaves denotes the total number of leaf nodes in the ontology. The IC func-
tions of equations (3.28), (3.29) and (3.30) can be used in equation (3.25) to
compute the similarity between two concepts without using a corpus.
Lin et al. use IC in an alteration of the similarity metric of Wu and Palmer
(1994). More specifically,
siml&p(c1, c2) =2 simresn(c1, c2)
IC(c1) + IC(c2), (3.31)
This approach aims to include the individual characteristics of the compared
nodes that Resnik’s approach neglected. Indeed, in Resnik’s measure, any two
pairs of nodes that have the same LCS produce the same similarity.
Jiang and Conrath follow a similar approach with Wu and Palmer (1994),
but avoid the scaling of similarity Jiang and Conrath (1997). Instead, they use a
distance metric as follows:
dj&c(c1, c2) = IC(c1) + IC(c2)− 2 simresn(c1, c2). (3.32)
Various transformations have been applied to convert this distance to similarity.
Among these, the authors in Seco et al. (2004) consider a linear transformation
and present the following formula of similarity normalized in the interval [0,1]:
simj&c(c1, c2) = 1− dj&c(c1, c2)
2. (3.33)
Another example can be found in Zhu et al. (2009), in which an exponential
function is used for the similarity formula, along with a constant λ that accounts
51
CHAPTER 3. SIMILARITY METRICS
for curve steepness:
simj&c(c1, c2) = edj&c(c1,c2)
λ . (3.34)
Feature-Based Measures
Feature-based measures do not necessarily conform to the similarity metric rules
of Chen et al. (2009), as they allow for similarity asymmetry. In feature-based
techniques, the two compared concepts are viewed as sets of features, in contrast
to the geometric view presented in previous sections. To calculate similarity, not
only the common features of the concepts are taken into account, but also the
differences between them. That way, common features improve similarity, while
different features penalize its value Tversky et al. (1977). Given concepts c1 and
c2, let C1 and C2 denote the sets that contain their features. Then, similarity
between the two can be given as:
simtve(c1, c2) =|C1 ∩ C2|
|C1 ∩ C2|+ µ|C1 − C2|+ (1− µ)|C2 − C1|, (3.35)
where µ is a weight which takes values in [0,1]. In Rodrıguez et al. (1999), the µ
parameter is computed as follows:
µ =
d(c1,LCS)d(c1,c2)
, d(c1, LCS) ≤ d(c2, LCS)
1− d(c1,LCS)d(c1,c2)
, else(3.36)
This asymmetric function stems from Tversky’s observation that similarity might
not be symmetric. In one of Tversky’s examples, North Korea was said to be more
similar to Red China than the reverse.
3.3.2 Inter-ontology Semantic Similarity
Inter-ontology semantic similarity measures try to quantify the similarity between
concepts that belong to different ontologies. Fairly little research has been doc-
umented in this area, due to the inherent difficulty of comparing heterogeneous
structures. A common approach is to combine the different ontologies into a
52
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
single ontology through detailed concept mappings Gangemi et al. (1998). It is
clear that this is very challenging and requires the help of a domain expert, as
well as plenty of time and effort. Furthermore, not all biomedical terminologies
are consistent and their lack of homogeneity is a major problem. Simpler ap-
proaches have been proposed in the literature. A usual first step is to merge the
different ontologies under a dummy root. This approach is found in Rodrıguez
and Egenhofer (2003), where the authors use a weighted version of Tversky’s
similarity which also takes into account geometrical features of the ontologies.
A similar route is followed by Petrakis et al. (2006), where the authors substi-
tute Tversky’s similarity with a form of Jaccard similarity. The drawback of
these cross-similarity metrics is that they do not consider term overlap in both
ontologies. Other methods rely on extensions of single ontology similarity met-
rics. Examples of such work can be found in Al-Mubaid and Nguyen (2006) and
Sanchez et al. (2012).
53
54
Chapter 4
Search Interfaces
Search has risen to be one of the most commonly used tools for computer users.
It can be found everywhere, from stand-alone web-based search engines to em-
bedded search forms that appear in desktop applications and websites. To a large
extent, success of the search procedure depends on the users’ ability to formulate
their information needs, transforming them into queries that are highly likely to
produce desired results. For this reason, a lot of effort has been spent on improv-
ing the search interfaces and providing tools that will enhance user experience.
In this chapter, the basic characteristics of successful search interface design are
presented, with main focus on web-search interfaces.
4.1 Information Seeking Models
Information seeking models attempt to recognize and describe the strategies fol-
lowed by humans from the moment they sense a search need until the moment
they acquire desired results. The search procedure may be viewed as a repetition
of actions. In Sutcliffe and Ennis (1998), the authors identify the following four
actions in what is considered the standard model of information seeking:
1. Problem Identification
2. Articulation of Need
55
CHAPTER 4. SEARCH INTERFACES
3. Query Formulation
4. Evaluation of Results
The first step refers to conceptualization of the search need, while the second step
involves expressing this need in words. The third step requires the user to trans-
form the articulated need into a format that will be accepted by the underlying
search system. Finally, the fourth step refers to the procedure of judging the
results critically, exploiting any relevant domain knowledge and deciding whether
the need is satisfied. A search may be characterized as ‘ok’, ‘failed’ or ‘unsatis-
factory’. An ‘ok’ search ends the cycle successfully. An ‘unsatisfactory’ search
may lead to reformulation of the query or re-articulation of the need, while a
completely ‘failed’ search might require re-identification of the problem.
Sutcliffe and Ennis’s model assumes that the need does not change, unless
results are disappointing. It does not capture the fact that users learn as they
search. This dynamic aspect of information seeking was captured in an earlier
work by Bates Bates (1989). In that study, the user’s needs are assumed to change
as the process advances. Furthermore, Bates claims that the success of the search
procedure does not only depend on the final list of results, but on the selections
made along the way. This model is referred to as the berry-picking model, to
denote that it does not result in a single set of results. A simple example of the
berry-picking model can be illustrated when a user attempts a broad query such
as ‘String similarity algorithms’ and refines the query to ‘Jaro similarity’ after
viewing this result in the initial result list.
4.2 Query Specification
Queries are usually specified through rectangular entry forms, as in Fig. 4.1. The
width of these forms varies in size, with studies showing that wider forms promote
formulation of longer queries Franzen and Karlgren (2000); Belkin et al. (2003).
It has been observed that around 88% of search queries are composed of 1 to 4
56
4.2. QUERY SPECIFICATION
Figure 4.1: The google search engine entry form.
Figure 4.2: Facebook uses grayed-out descriptive text to help in the formulation of user
queries.
words, with mean length equal to 2.8 words per query Jansen et al. (2007). The
actual search is executed by pressing the return key or mouse-clicking a specified
button (e.g. magnifying glass in Bing). In some cases, entry forms decorate their
background with descriptive text that provides guidance for the user. An example
is Facebook’s search form, as seen in Fig. 4.2. The text disappears, once the user
clicks inside the form. This usually helps to narrow down the search domain.
After query submission, processing of the query takes place before any attempt
to retrieve results. This process may include removal of stopwords (i.e. words
with high appearance probability such as ‘the’, ‘a’), normalization of words (e.g.
plural to singular) and permutation of word order. Boolean logic may also be used
in the case of multiple words per query. Returning results that contain all query
words (i.e. Boolean AND operator) seems more intuitive, although this might
sometimes lead to overly specific queries that return no results. The actual types
of processing are often hidden from the users, in an attempt to avoid confusion
and promote transparency, Muramatsu and Pratt (2001).
Most modern search interfaces are equipped with dynamic search suggestion,
also known as auto-completion (See Fig. 4.3). As the user starts typing, a list of
57
CHAPTER 4. SEARCH INTERFACES
Figure 4.3: Bing’s search interface features a powerful dynamic search suggestion, where
prefixes are highlighted with grayed-out font and the remaining text is in bold.
term suggestions appears under the entry form. The suggestions contained in the
list are usually queries whose prefix matches what has been typed so far, although
there are cases where interior matches are also included. The user can then mouse-
click the most relevant query or navigate through the list, using keyboard arrows.
Studies have shown that approximately one third of all search attempts in the
Yahoo Search Assist were performed through a dynamically suggested query An-
ick and Kantamneni (2008). The dynamic search suggestion technique attempts
to minimize unneeded typing from the user side and can alleviate spelling errors
early. Most importantly, though, it reassures the user that results are available,
so there is no frustration from empty result pages.
An important point to consider is that searchers often return to their pre-
viously accessed information. In the empirical study undertaken by Tauscher
and Greenberg Tauscher and Greenberg (1997), it was found that there is a 58%
chance that the next web page to be visited had been visited before. A more
recent study Zhang and Zhao (2011) about tabbed browsing, conducted in 2010,
also finds page revisitation to be around the same levels, at 59.3%. Various tools
58
4.2. QUERY SPECIFICATION
Figure 4.4: The Safari browser’s embedded search interface explicitly states which queries are
suggestions and which belong to the user’s recent search history.
Figure 4.5: The Firefox browser’s embedded search interface contains recent queries on top,
and separates them from suggestions using a solid line.
exist to help users find their intended pages, including Uniform Resource Locator
(URL) history, bookmarking of pages, basic navigation buttons (e.g. ‘Back’ but-
ton for short term page revisit) and change of URL font color if page has already
been visited. Among other methods documented, users may save whole webpages
to their local disk or keep URLs in text documents, after enriching them with
comments Jones et al. (2002). Interestingly, a common approach to revisiting
documents is actually re-searching for them Obendorf et al. (2007). Users who
59
CHAPTER 4. SEARCH INTERFACES
Figure 4.6: Google’s search results page is a typical scrollable vertical list of captions. Meta-
data facets, that restrain results to a particular type of information, are also present in the
interface (e.g. ‘Images’ tab).
adopt this strategy attempt to re-create the conditions of their previous search, by
trying to formulate the exact same query. Another strategy requires past search
queries to appear as the user types, along with regular dynamic term sugges-
tion. Separation between suggested queries and previously generated ones varies
among interfaces, as can be seen in Figures 4.4 and 4.5.
4.3 Presentation of Search Results
Search applications usually present results as a vertical list of captions, distributed
along multiple pages (see Fig. 4.6). Each caption is a clickable entity which, as a
minimum requirement, comprises a title and an excerpt of the target document
Clarke et al. (2007). Usually, the excerpt includes some or all of the query terms,
as highlighted text. In most cases, highlighting is performed using bold font or
colored term background. Many search applications tend to group similar results,
that originate from the same source, into the same caption. That way, result
60
4.3. PRESENTATION OF SEARCH RESULTS
‘pollution’ from few sources is avoided and diversity is promoted. The relevance
of search results is reflected in their order of appearance. Although relevance
scores were formerly used to grade the fit of the result to the query, they are
usually not present anymore in modern search applications. The reasons behind
their omission might be to avoid reverse-engineering of the ranking algorithms and
to reduce redundancy, since the ranking itself already reflects the importance of
results Hearst (2009).
It has been observed that users tend to click on the uppermost captions
Joachims et al. (2005). In the same study, it was found that the first caption
received more attention than its successors, even if its relevance was actually
lower. Furthermore, the majority of users often remain on the first page of re-
sults. The authors in Jansen et al. (2007) observed that only 30% continued to
look for relevant results in the second page of the results, and only 15% looked
even further. Usually, the patience of a user is a function of his/her experience
in using the system. More experienced users tend to be more patient than users
who are not accustomed to the search procedure. Inexperienced users, on the
other hand, often prefer to refine their query or simply accept that what they
search for cannot be found by the search application Hearst (2009).
Apart from plain lists of results, further organization of captions may be per-
formed, using some form of faceted browsing. Facets attempt to refine search
results, according to their characteristics. As an example, Amazon’s search in-
terface provides facets that correspond to the different departments that might
contain the desired item (see Fig. 4.7).
61
CHAPTER 4. SEARCH INTERFACES
Figure 4.7: Amazon’s search interface provides facets as a left panel to the results page,
helping the user dynamically refine the initial search.
4.4 Query Reformulation
It is common that desired search results are not discovered with the first try.
Query reformulation is the procedure which attempts to transform the original
query to a format that will match the information retrieval system’s vocabulary.
Studies using query logs have shown that the number of reformulated queries may
reach up to 52% Jansen et al. (2005) of all queries. It has been observed that,
if no help for query reformulation is given explicitly by the search application,
users tend to provide simple alterations of the initial query Hertzum and Frøkjær
(1996). This bias towards initial queries is referred to as anchoring, a term coined
by psychologists Tversky and Kahneman (1975).
One of the most common sources of search failure is query mistyping Cucerzan
and Brill (2004). A common approach, which aims to correct typographical errors,
is using a dictionary and finding the most similar term to the erroneous query
Kukich (1992). Among other techniques mentioned in that work are heuristic
rule-based corrections, probabilistic approaches that determine how often specific
62
4.4. QUERY REFORMULATION
sequences of characters are spelt wrong, and neural network models that train
the system to automatically identify errors. The outcome of the reformulation
procedure may be shown explicitly on the interface as a suggested query (e.g.
Google’s ‘Did you mean’), or be implicitly shown in the results. The former
approach is preferred, since it gives users freedom to decide whether their intent
is actually captured in the proposed correction. More recently, distributional
approaches that take advantage of user query logs are preferred, especially by
web-based search engines Li et al. (2006).
Another dimension of query reformulation is term expansion. Term expansion
refers to the suggestion of queries that relate to the initial one in some way.
Choice of related queries might take the form of thesaurus-based term substitution
Dennis et al. (1998) or attempt to extend the present query, usually by adding
single words (see Fig. 4.8). Query suggestion might also be fetched from sessions
of users who previously searched for the same information. In has also been
proposed that search applications ask the user to provide relevance feedback
Ruthven and Lalmas (2003). Although theoretical studies approve of this feature,
its appearance in commercial applications is rare.
63
CHAPTER 4. SEARCH INTERFACES
Figure 4.8: Pubmed’s results page includes term expansion in two ways. On the right of
the screen, there is a ‘Related searches’ panel that preserves the initial query and adds a new
related term to it. Also, right below the entry form there is a ‘See also’ feature which suggests
complete or partial modifications in the initial query.
64
Chapter 5
Requirements
This chapter describes the objective of the project and the required functionality
for the application, as stated by the AstraZeneca side.
5.1 Feature Specification
The objective of this project is to deliver a search application that allows re-
searchers to quickly perform queries for terms included in medical ontologies and
gain access to information about the chosen terms in intuitive ways. The appli-
cation should not rely on the searcher’s knowledge about the structure of specific
ontologies. The interface should be enhanced with interactive tools that guide the
user towards the desired term; this includes term auto-completion, input query
error correction, suggestion of similar terms, clever ranking and grouping of search
results. The deliverable should be straightforward to use and easy to distribute
to users, independent of the different operating systems that they might use.
Furthermore, it should include the terminology MedDRA, which is widely used
by researchers within the company.
The previous search application used within AstraZeneca did not manage to
meet the users’ requirements and was abandoned, as users had to refer to external
sources (e.g. Google) to refine their searches, when the application presented un-
65
CHAPTER 5. REQUIREMENTS
Table 5.1: Documented failed queries and suggested reasons for failure.
Query Comments Suggested Reason for Failure
Hepatotoxicity Searcher did not find the term
and decided to search on-
line to find a synonym for it
and reformulate the query as
‘Liver Disease’.
Wrong ontology choice by user. The
term is clearly in MedDRA. It is
also not an LLT, so the application
would find it.
NSCLC The acronym refers to ‘Non-
Small Cell Lung Carcinoma’,
a concept which is listed in
NCIT. Search returned no re-
sults.
Although the abbreviation ‘NSCLC’
is documented in NCIT, it is not a
preferred name so it was bypassed
by the program.
DIHS Searcher expected the concept
‘Drug-induced hypersensitiv-
ity syndrome’ in MedDRA.
No results were returned.
DIHS does not appear as an abbre-
viation in MedDRA, so this behav-
ior was normal. Searcher needed
to explicitly specify the preferred
name, which is ‘Drug-induced hy-
persensitivity syndrome’.
DRESS
Syndrome
Refers to the same concept as
DIHS. It was not found.
The term exists as an LLT in
MedDRA. The application did not
search for LLTs.
wanted results. The users’ lack of knowledge around formal logic and ontological
structure played an important role towards this result. To quote the AstraZeneca
side, “Many of our users do not understand the concept of an ontology and, as
a result, at best, struggle to use such an interface and, at worst, refuse to use
the tool (e.g. they don’t understand the concept of parent/child or if there are
multiple terms which should they choose). What users are more familiar with is a
google-like interface whereby they are able to type in their search terms without
knowledge of an ontology or what that means for them.”
Although no log file containing extensive lists of query failures is available
66
5.1. FEATURE SPECIFICATION
Table 5.2: Documented failed queries and suggested reasons for failure (cont.).
Query Comments Suggested Reason for Failure
VEGFR Searcher came across multiple
returned terms and did not
know which one(s) to choose.
Therefore, all were chosen.
The application does not help the
user visualize possible relationships
among results. Also, NCIT lists
VEGFR as a synonym for both
‘Vascular Endothelial Growth Fac-
tor Receptor’ and ‘Vascular En-
dothelial Growth Factor Receptor
1 (VEGFR-1)’, so it is up to the
searcher to decide which one is
needed.
LHRH Most relevant result was ‘Go-
nadotropin Releasing Hor-
mone’. The searcher did not
know that term, and did not
understand why the results
did not contain the query.
The preferred term for ‘LHRH’ is
‘Gonadotropin Releasing Hormone’.
NMDA
Antagonist
The searcher wanted to find
a list of the different NMDA
antagonists. No results were
found in NCIT, MedDRA or
ICD.
This is an ontology organization
characteristic. For example, in
NCIT, antagonists do not all reside
under a general term ‘NMDA an-
tagonist’. The NMDA antagonist
‘Ketamine’ is listed in NCIT as a
subclass of ‘Anesthetic Substance’,
while ‘Aptiganel’ is listed as a sub-
class of ‘Neuroprotective Agent’.
for AstraZeneca’s search application, examples of failed queries have been given.
The reasons behind query failure are diverse; Tables 5.1,5.2 list some of the most
characteristic failed queries, along with given or deduced justifications for the
67
CHAPTER 5. REQUIREMENTS
reason of failure. It is clear that failure of some queries was due to the content
of the ontologies, therefore inevitable. Other causes of failure included wrong
ontology chosen by the user, incomplete term coverage by the search application,
lack of help and guidance from the system (e.g., relevance feedback or result
visualization). These application-level failures should be targeted and alleviated.
68
Chapter 6
Design
This chapter addresses the design considerations for each stage of the project. In
particular, three distinct stages can be identified; the first involves gaining access
to ontologies, the second is concerned with semantic similarity calculations, while
the third covers data presentation and interface design.
6.1 Stage I: Access to Medical Ontologies
The first design stage involves gaining access to medical ontologies and terminolo-
gies. It might be argued that ontologies should be exploited in a formal ontology
language representation, such as OWL. This was abandoned for the following
reasons:
1. Not all medical terminologies respect ontological principles, thus they are
not all representable in a formal ontology language.
2. Access to the original format of some structured vocabularies (e.g. Med-
DRA) is neither public, nor free.
3. Currently, using the Java OWL1 Application Programming Interface (API),
large OWL ontologies need to be kept in main memory for the whole du-
1http://owlapi.sourceforge.net/
69
CHAPTER 6. DESIGN
ration of program execution, fact which would degrade performance in the
case of multiple ontologies.
Fortunately, BioPortal2 has already represented hundreds of ontologies and ter-
minologies in a common format, which is publicly accessible through the web Noy
et al. (2009).
As a result of the above observations, it was decided that the best design
choice would be to maintain a local MySQL database with ontology terms. For
demonstration purposes, three different structured vocabularies are used in this
project:
• NCIT
• MedDRA
• ICDv9
They are downloaded from BioPortal and saved locally. From these, only NCIT
is frequently updated, at approximately monthly intervals. The used versions of
NCIT, MedDRA and ICDv9 contain 97946, 69389, and 22400 concepts, respec-
tively.
6.1.1 Database and Table Creation
Initially, a MySQL database named ‘Ontologies’ is created locally. The database
holds a total of seven tables, having the following names:
• CONCEPTS
• DEFINITIONS
• SYNONYMS
• ROOTS
2http://bioportal.bioontology.org/
70
6.1. STAGE I: ACCESS TO MEDICAL ONTOLOGIES
• PARENTS
• SIMILARITY
• MDR RELATED
Table 6.1: ‘Ontologies’ database table structure
Table Name Type
CONCEPTS code
preferredName
ontology
varchar(20)
text
varchar(15)
DEFINITIONS code
definition
varchar(20)
text
SYNONYMS code
synonym
varchar(20)
text
ROOTS code
ontology
varchar(20)
varchar(15)
PARENTS code
parentCode
varchar(20)
varchar(20)
SIMILARITY termcode1
termcode2
rada
wu
resnik
li
varchar(20)
varchar(20)
double
double
double
double
MDR RELATED code
relatedCode
varchar(20)
varchar(20)
The ‘CONCEPTS’ table will hold basic information about the concepts that
are present in an ontology. More specifically, for each concept, a record which
contains its preferred name, code, and ontology will be inserted to the table. Due
to the fact that multiple definitions and synonyms might exist for a single con-
cept, these will be held in separate tables, ‘DEFINITIONS’ and ‘SYNONYMS’,
71
CHAPTER 6. DESIGN
respectively. The ‘ROOTS’ table will contain all the top level terms of the on-
tology/terminology. Usually, multiple independent hierarchies exist, therefore
multiple ‘roots’ can be found. For example, MedDRA contains 26 parallel hi-
erarchical structures. These so-called ‘roots’ can be joined under a top-level
universal imaginary node, that guarantees the presence of a single root in the
ontology/terminology. The table ‘PARENTS’ will contain hierarchical informa-
tion about the terms. For each concept, all of its parents will be listed. This
table can be exploited to compute semantic similarity at the next stage. The
‘SIMILARITY’ table will hold semantic similarity scores between pairs of con-
cepts that belong to the same ontology. The similarity metrics used are those
of Rada, Wu-Palmer, Resnik and Li. Finally, the ‘MDR RELATED’ table will
contain MedDRA-specific concepts that do not clearly belong to any hierarchy
themselves, but are considered very close to terms that do. The detailed struc-
ture of tables is shown in Table 6.1. All tables, except ‘SIMILARITY’ will be
populated at this stage.
6.1.2 Populating the Database Tables
The procedure for downloading the chosen ontologies and populating the database
tables relies on the BioPortal Representational State Transfer (RESTful) ser-
vices3. These services allow the transfer of medical ontology information, from
BioPortal servers to end user systems, through the Hypertext Transfer Protocol
(HTTP). The response is, by default, in XML format, with limited support for
JavaScript Object Notation (JSON) format. Complete support for JSON out-
put is scheduled for next release. Accessing the BioPortal RESTful services is
performed through the usage of intuitive Uniform Resource Identifiers (URIs) of
predefined structure. All that is required for gaining access to the RESTful ser-
vices is a user-specific API key, which is immediately given when a free account is
created on the BioPortal website. Some examples of the types of available term
3http://www.bioontology.org/wiki/index.php/BioPortal_REST_services
72
6.1. STAGE I: ACCESS TO MEDICAL ONTOLOGIES
services are given in Table 6.2. Quantities in brackets are user-defined. As an
example request, consider the ‘get all terms’ service for NCIT:
http://rest.bioontology.org/bioportal/virtual/ontology/1032/all?pagesize=
50&pagenum=1&apikey=c6ae1b27-9f86-4e3c-9dcf-087e1156eabe. The virtual on-
tology id 1032 refers to NCIT. As stated before, the API key is a string identifier
which is received upon free registration to BioPortal. The response includes the
first 50 terms of the NCIT ontology. A (part of the) XML response is shown in
Fig. 6.1. It should be observed that the ‘get all terms’ service does not actually
return all terms from a specific ontology at once; for each request, the user must
provide a ‘terms-per-page’ number, and the particular page that he/she wishes to
view. All pages can be returned, if the user continues issuing page requests with
increasing pagenum, provided that the user knows the number of concepts that
the ontology includes.
Table 6.2: Examples of URI formats for BioPortal RESTful services.
Service URI format Comments
Get all
terms
http://rest.bioontology.org/bioportal/
virtual/ontology/ontologyid/all?pagesize=
pagesize&pagenum=pagenum&apikey=
YourAPIKey
Returns all
terms of an
ontology page
by page.
Get
concept
info
http://rest.bioontology.org/bioportal/
virtual/ontology/ontologyid/
conceptid&apikey=YourAPIKey
Returns infor-
mation about
a specific term,
such as syn-
onyms and
definitions.
Get
latest
ontology
version
http://rest.bioontology.org/bioportal/
virtual/ontology/ontologyid?apikey=
YourAPIKey
Returns the cur-
rently used ver-
sion id of an on-
tology.
73
CHAPTER 6. DESIGN
Figure 6.1: A part of the XML response for the ‘get all terms’ query of Table 6.2.
Access to BioPortal RESTful services can be achieved programmatically in a
simpler and automated manner, using the ontoCAT4 Java API. This API pro-
vides classes and methods tailored to the BioPortal services. It provides a high
level abstraction, that handles queries and XML responses behind the scenes and
returns lists of Java objects that contain the information needed to populate the
database tables. The provided methods are shown in Fig. 6.2.
The ontoCAT API method ‘getAllTerms()’ returns a list of all terms in the
ontology, which is what is needed in this project. Its drawback is that it keeps all
ontology terms in memory, causing a heavy memory burden which may lead to
‘out of memory’ exceptions when further processing is needed. For this reason,
I introduced a new function ‘getAllTermsPageByPage()’, which allows retrieving
and processing terms page by page in a loop. Then, memory can be released
after each iteration. In order to save information to the database tables, the
4http://www.ontocat.org/
74
6.1. STAGE I: ACCESS TO MEDICAL ONTOLOGIES
Figure 6.2: The provided methods of the ontoCAT API Adamusiak et al. (2011).
‘getAllTermsPageByPage()’ method is called. It is chosen that pagesize=1, so
that only one concept per page is returned. Then, for each concept returned
by ontoCAT, the required information is saved to the appropriate table in the
‘Ontologies’ database. The procedure is shown in Fig. 6.3. The Java applica-
Figure 6.3: Populating the ‘Ontologies’ database is performed with the help of the ontoCAT
API.
tion, that was developed for this project, requests all concepts of a BioPortal
ontology, page by page, using ontoCAT methods. OntoCAT acts as an inter-
mediary, responsible for accessing the RESTful services of BioPortal. It returns
75
CHAPTER 6. DESIGN
Java object(s) back to the Java application, after processing the XML response
of BioPortal. Once the Java application receives information about a term, all
that is left is to choose the appropriate table(s) in the ‘Ontologies’ database and,
through the Java Database Connectivity (JDBC) API, insert record(s) of MySQL
format. Once all pages are processed, the Java application finishes execution and
all tables, except SIMILARITY, are populated.
6.2 Stage II: Computation of Semantic Similar-
ity
This stage deals with the calculation of semantic similarity scores between pairs
of concepts that reside in the same ontology. Semantic similarity scores will be
saved in the SIMILARITY table and will later be used in the search application
for the semantic grouping of search results and the suggestion of highly similar
terms to a term chosen by the user. To populate the SIMILARITY table, the
already populated tables CONCEPTS, PARENTS and ROOTS will be used.
6.2.1 Term Neighborhoods
Computing semantic similarity between all concept pairs in an ontology is a te-
dious task which requires a lot of computational and storage resources. Let us
consider NCIT as an example: there are 97946 concepts, yielding 979462 pairs5,
whose semantic similarity must be calculated. This is not the only burden; seman-
tic similarity calculation of a single pair is, by itself, a time-consuming process.
For example, even for the simple Rada edge-counting measure, all connecting
paths between two concepts must first be computed (i.e. a recursive process)
and, finally, the shortest one chosen. In large ontologies, it is not unusual that
5actually, due to the symmetric property of similarity, there is no need to calculate all 979462
pairs. Also, self similarities can be avoided, depending on the similarity metric used. Still, the
numbers are huge.
76
6.2. STAGE II: COMPUTATION OF SEMANTIC SIMILARITY
multiple paths of variable length exist between two concepts, so finding the min-
imum path is not as trivial as it may seem.
In the final search application, semantic similarity will be used for suggesting
highly similar terms to the query or grouping highly similar terms. Therefore,
term pairs whose semantic similarity is low will never be needed. For example,
there is no point in storing or even computing the similarity between the NCIT
concept ‘Greece’ and the concept ‘Lung’, since the resulting very low score will
never be used in the search application itself. The term ‘Greece’ will never be
suggested as a highly similar term of ‘Lung’, and vice versa.
For the above reasons, the design choice for this project is to exploit the ge-
ometrical structure of ontologies/terminologies and, for each concept, calculate
semantic similarity only with concepts that are placed within a certain neighbor-
hood from it. Given a concept c, its neighborhood is chosen to contain:
• All concepts that are descendants of c at most two levels down in the hier-
archy.
• All concepts that are siblings of c.
• All concepts that are ancestors of c, at most two levels up in the hierarchy.
This choice greatly simplifies the computational burdens associated with semantic
similarity computation in huge ontologies, without threatening the performance
of the search application. Furthermore, valuable mySQL storage is not wasted.
6.2.2 Semantic Similarity Calculation
In this project, four different semantic similarity metrics have been chosen: Rada,
Wu and Palmer, Resnik and Li. Due to lack of a specific corpus for Resnik
similarity, Seco’s formula is used, as presented in Chapter 3 (see 3.28). For the
calculation of semantic similarity, I developed a Java application, which contains
the following basic methods6:
6method parameters and other utility methods are not shown, for simplicity
77
CHAPTER 6. DESIGN
• getAllPathsToRootDB()
• getMinimumPathToRootDB()
• getAllPathsBetweenTwoConceptsDB()
• getMinimumPathBetweenConceptsDB()
• computeLocalSimilarities()
• NormalizedRadaSimilarity()
• WuPalmerSimilarity()
• LiSimilarity()
• ResnikSimilarity()
The method getAllPathsToRootDB() uses the PARENTS table to recursively
build all paths between a concept and any of the roots of an ontology. Recursion
stops every time a concept which belongs to the ROOTS table is encountered.
The method getMinimumPathToRootDB() simply calls getAllPathsToRootDB()
and chooses the minimum path out of the returned ones. The method getAll-
PathsBetweenTwoConceptsDB() first computes each term’s paths to the root
separately, using the getAllPathsToRootDB() method. Then, it compares each
of the first term’s paths to root to each of the second term’s paths to root; if any
two paths have common nodes, it means that a common path (that passes through
their LCS) can be defined between the nodes; if no common nodes are present,
a common path only exists through the single (imaginary) root of the ontology.
The method getMinimumPathBetweenConceptsDB() simply calls getAllPaths-
BetweenConceptsDB() and selects the shortest one.
The methods NormalizedRadaSimilarity(), WuPalmerSimilarity(), LiSimilar-
ity(), and ResnikSimilarity() call the previously mentioned path building methods
with two concepts as arguments, and produce a numerical value that corresponds
to the particular similarity metric. The method computeLocalSimilarities() is
78
6.3. STAGE III: INTERFACE DESIGN DATA PRESENTATION
the one that is called from main(). This method is responsible for computing the
neighborhoods of a term, calling the NormalizedRadaSimilarity(), WuPalmerSim-
ilarity(), LiSimilarity(), ResnikSimilarity() on each pair of concepts, and saving
the results to the SIMILARITY table.
6.3 Stage III: Interface Design Data Presenta-
tion
At the end of stage II, the ‘Ontologies’ database is complete and does not need
further changes. The third stage deals with querying the available data and
presenting it to the end user. It has been chosen to utilize web technologies
for developing the search application. Building the search application in a web
environment presents, among others, the following advantages:
• The files reside on a central server, and not on each of the clients’ machines
individually. Updates may be done transparently.
• Access to the application by client systems is independent of their operating
system.
• The application can benefit from the browsers’ built-in functionality (e.g.
no need to provide separate back-forward buttons).
• The application can benefit from the huge variety of interactivity tools that
have been designed for webpages.
The information to be presented is fetched from the populated MySQL ta-
bles using the server-side scripting language PHP Hypertext Preprocessor (PHP).
Presentation and styling are achieved using the Extensible HyperText Markup
Language (XHTML) and Cascading Style Sheets (CSS), respectively. Auto-
completion is performed using Asynchronous JavaScript and XML (AJAX) which
79
CHAPTER 6. DESIGN
returns data in JSON format, to be fed to the Twitter Typeahead jQuery plu-
gin7. To further favor interactivity, various jQuery plugins are selected, includ-
ing Tipsy8 and Throttle/Debounce9. Finally, for visualization purposes, the D3
framework10 is used. The major advantage of all the above technology choices
is that they are widely used, cross-platform and open-source, meaning that they
are actively maintained, highly portable and modifiable. More details about their
usage will be presented in chapter 7.
6.4 Summary of Technology Choices
A summary of the technology choices for the project is shown in Table 6.3. The
table is divided into sections that refer to the three stages described previously.
The technologies, languages, frameworks and APIs used at each particular stage
are mentioned.
7https://github.com/twitter/typeahead.js8https://github.com/jaz303/tipsy9http://benalman.com/projects/jquery-throttle-debounce-plugin/
10http://d3js.org/
80
6.4. SUMMARY OF TECHNOLOGY CHOICES
Table 6.3: Technology choices for the project.
Stage Description Technologies, Languages,
Frameworks, or APIs
I Access
to Medical
Ontologies/Terminologies
Java
BioPortal RESTful Web API
ontoCAT Java API
JDBC API
MySQL
II Computation of
Semantic Similarity
Java
JDBC API
MySQL
III Interface Design and
Data Presentation
PHP
MySQL
AJAX
XHTML
CSS
JavaScript
D3
jQuery Twitter Typeahead
jQuery Tipsy
jQuery Throttle/Debounce
JSON
81
82
Chapter 7
Implementation
This chapter provides a thorough description of the features that are present in
the final search application. It introduces the visual interface, which is respon-
sible for interaction with the end user. Furthermore, it familiarizes the reader
with the functionality of the individual components that are responsible for the
presentation, styling and interactive behavior of the application.
7.1 Structure
The organization of the files used for building the web application is listed in
Fig. 7.1. The functionality of each file is briefly described in Tables 7.1, 7.2, 7.3
and 7.4.
7.2 Search Entry Form
As mentioned in section 4.2, queries are usually less than or equal to 4 words.
That result reflects query specification in web-based search engines, where users
can search about any topic they wish for. In the more granular biomedical do-
main, users usually attempt more targeted searches. Furthermore, the application
to be deployed in this project is aimed at term searching, instead of document
searching. Thus, users are aware that they are searching for short-length terms
83
CHAPTER 7. IMPLEMENTATION
instead of multi-page documents, and it is likely that queries are even shorter
than the average 2.8 words. Indeed, the example queries given by AstraZeneca
are comprised of at most two words. Also, due to the auto-completion feature,
lengthy terms will not need to be typed, but simply chosen from a dynamic list.
Despite the fact that short queries are expected, a wide entry form is chosen, to
resemble ‘Google-like’ experience and provide better visibility for auto-completion
features.
Figure 7.1: The organization of the files that comprise the web application. These files are
responsible for the presentation, styling and interactive behavior of the web application.
84
7.2. SEARCH ENTRY FORM
Table 7.1: PHP files used in the search application.
File Description
mysqli connect.php Script which establishes a connection to the MySQL
‘Ontologies’ database. This script should not be pub-
licly accessible, for security reasons.
index.php The main page. It also handles enter-key or mouse-
click searches, by querying the ‘Ontologies’ database
and presenting the search results table.
performQuery.php Script which queries the ‘Ontologies’ database and
echoes a JSON array of the results.
terminfo.php Presents information about a specific term, including
its code, definitions, and synonyms. A visualization
of highly similar terms is shown, using d3.v3.min.js
and jquery.tipsy.js. Also, an XML version of the vi-
sualization is shown.
Combinatorics.php Performs permutations of a set of items (e.g. words
of the query).
JaccardSimilarity.php Computes the Jaccard lexical similarity between two
strings.
Table 7.2: XHTML files used in the search application.
File Description
header.xhtml Contains the shared header information among all
web pages. This includes the search box.
footer.xhtml Contains the shared footer information among all
web pages.
The search box can be seen in Fig. 7.2, inside the main window of the search
application (index.php). The search box is placed at the top-central part of the
interface. It is visible on every page that a user visits, so that new queries can be
performed anytime the user wishes. The box is characterized by rounded corner
85
CHAPTER 7. IMPLEMENTATION
Table 7.3: CSS files used in the search application.
File Description
contentStyle.css Defines styles for the web application interface.
tipsy.css Defines styles for building interactive tooltips.
type.css Defines styles for the auto-completion function.
Table 7.4: JavaScript files used in the search application.
File Description
d3.v3.min.js A JavaScript library that allows binding arbitrary
objects to the DOM. It facilitates the development
of visualization tools.
hogan-2.0.0.js A JavaScript library that allows the sharing of tem-
plates between client and server.
jquery-1.10.1.js A JavaScript library which facilitates DOM manipu-
lation, event handling, animation and AJAX.
jquery.ba-throttle-debounce.js A plug-in for throttle and debounce. Throttle limits
the rate of execution of handlers. Debouncing en-
sures that a function is executed only once within a
certain time period.
jquery.tipsy.js A jQuery plugin for creating Facebook-like tooltips.
typeahead.js A jQuery plug-in for auto-completion, developed by
Twitter. It may receive an array of JSON objects to
build the auto-completion pop-up menu.
performAsynchronousQuery.js A script which calls performQuery.php and feeds the
returned JSON object array to typeahead.js.
edges, a CSS3 feature. Also, a helpful message is set as a placeholder when the
search box is out of focus. This message informs the user of the type of query that
should be input. Once the user clicks inside the box, the grey message disappears
and a blinking cursor appears (see Fig. 7.3). If the user clicks anywhere else
within the page, the message reappears.
86
7.2. SEARCH ENTRY FORM
Figure 7.2: The main window of the search application. The search box is placed at the
top of the screen, with central horizontal alignment. A submit button labeled ‘Search’ is also
provided, to assist users that prefer mouse-clicking.
Figure 7.3: Once the user clicks inside the search box, the grey help message disappears and
a blinking cursor takes its place.
87
CHAPTER 7. IMPLEMENTATION
7.3 Handling the Input Query
The user may input a multi-word query in the provided search box. Handling the
input query depends on the speed that the user is typing, and the keys or buttons
that are pressed or clicked. To trigger the search, the user has the freedom to
choose among pressing the return key, selecting a term from the pop-up auto-
completion menu and mouse-clicking the button labeled ‘Search’, which is placed
on the right side of the search input form.
7.3.1 Typing Speed
If a user presses keys at a fast pace, there is no need to burden the server with
consecutive requests, when only the last response will be examined by the user.
To achieve such functionality, a debounce function is used (defined in jquery.ba-
throttle-debounce.js), which ensures that only the last event is taken into account,
within a certain microsecond time period. Then, unintended requests are avoided
and the application’s performance is maintained at high levels.
7.3.2 Querying the Database
Once a query has been approved for processing, it is sanitized, i.e. it is ensured
that its format is appropriate for insertion into a formal MySQL query and that
SQL injections are avoided. The formed MySQL query searches for terms that
contain the input words as prefixes, in the CONCEPTS and SYNONYMS tables
of the ‘Ontologies’ database. For example, an input query ‘can lun’ will return,
among others, the terms ‘lung cancer’ and ‘cancer of lung’, since all input
words are found as prefixes of words included in the terms. On the other hand,
the query ‘carc lun’ will not return the above two terms, since the ‘carc’ term
is not matched. It should be noted that order of the input query words is not
important. Also, mid-word matches are not supported, so a query ‘ance’ will not
return the term ‘cancer’.
88
7.3. HANDLING THE INPUT QUERY
Finally, it has been chosen that only a single result is returned per concept; a
single concept might have multiple synonyms that match the same query. For
example, the query ‘lung ca’ returns both ‘lung cancer’ and ‘lung carcinoma’,
terms which correspond to the same concept. Presenting both terms in the results
would be redundant, so only the lexically closest term to the query is presented
(i.e., ‘lung cancer’). Thus, a term appearing in the results is not always the
preferred term for a concept, but the term that best matches the given input
query.
7.3.3 Ranking and Grouping of Search Results
Lexical similarity determines the ranking of search results, independent of how
the search is triggered. For each term returned from the database query, the
lexical similarity of its term name is computed against the input query. The final
score is the maximum of a character-based and a word-based lexical similarity. In
this project, Levenshtein and Jaccard similarities are used, implemented as PHP
functions. The similarity takes a value in [0, 1] and is converted to a percentage
for visual purposes.
Semantic similarity determines the grouping of search results. For each term
in the results, its semantic similarity is retrieved with all the remaining result
terms that reside lower in the table. This is achieved through MySQL queries
to the SIMILARITY table. Highly similar terms (i.e., whose semantic similarity
score is larger than a threshold, 0.75 or 75% in this project), are grouped together.
From the semantic group, the term with highest lexical similarity to the query
acts as the main concept in the table row, and similar terms appear indented.
This choice preserves the lexical ranking. As an example, a search for ‘Lung’ is
shown in Fig. 7.4. The terms ‘Right Lung’ and ‘Left Lung’ are highly similar
to ‘Lung’, so are presented in the same row. The main term which shelters the
rest is ‘Lung’, since it is lexically identical to the input query. Semantic grouping
is performed only in the return-key or mouse-click search cases, and not in the
89
Figure
7.4:
Term
s,th
atw
ould
ap
pear
onth
eirow
nta
ble
row,
are
gro
up
edu
nd
era
more
lexically
-match
ing
termto
the
qu
ery,w
hen
their
seman
ticsim
ilarity
toth
atterm
ishig
her
than
ath
reshold
.
90
7.3. HANDLING THE INPUT QUERY
auto-completion menu.
7.3.4 Return-key or Mouse-click Search
If the user presses the Return key or clicks on the ‘Search’ button, the query is
processed by index.php. The form is submitted using the HTTP GET method, as
can be seen from the URL of Fig. 7.5. The index.php receives the ‘query’ string
through the predefined $ GET variable in PHP. After the MySQL database is
queried, results are presented in an array with clickable entries that redirect
to the specific term information page. Lexical ranking and semantic grouping
are performed. Each array entry contains basic information about the specific
concept, including term name, preferred name for the concept, code identifier in
the ontology, abbreviation of the ontology it belongs to, and lexical similarity
score from comparison to the input query.
7.3.5 Auto-completion Search
If the user presses any key other than ‘Return’, the query is processed by perfor-
mAsynchronousQuery.js to produce auto-completion. Auto-completion requires
that the page is not reloaded. The JavaScript function performAsynchronous-
Query() uses AJAX to send an asynchronous query request to performQuery.php.
The performQuery.php queries the MySQL database and returns an array of the
results as JSON objects (see Fig. 7.6), which, in turn, are fed to typeahead.js
to create the auto-completion pop-up menu, as seen in Fig. 7.7. Each entity in
the auto-completion pop-up menu is dedicated to a single term. It presents four
different types of information about it. On the top-left part, the term name that
best matches the query is shown. This is not always the preferred-name for the
term. For this reason, the lower left part of the entity always holds the preferred
term name for the concept. The lower-right hand side hosts the abbreviation
of the ontology/terminology from where the term is extracted. Finally, at the
upper-right hand side, the lexical similarity to the input query is shown. For this
91
Figure
7.5:
Pressin
gth
e‘R
eturn
’key
orclick
ing
the
‘Sea
rch’
bu
tton
sub
mits
the
qu
eryto
index.p
hp
and
atab
leof
searchresu
ltsis
add
edto
the
interface.
92
7.3. HANDLING THE INPUT QUERY
Figure 7.6: Part of the JSON response from performQuery.php, for the input query ‘rash’.
Each JSON object represents a term matching the query, and contains information that can be
used for its presentation.
Figure 7.7: Pressing any other key except ‘Return’ submits the query through AJAX to
performQuery.php and an auto-completion pop-up menu is created from the JSON response.
93
CHAPTER 7. IMPLEMENTATION
project, the maximum number of entities that the auto-completion pop-up menu
can contain has been set to 8.
7.4 Error Correction
If no term matches are found for the input query, the application tries to guess
the intended query and match it to the closest term in the CONCEPTS and
SYNONYMS database. Returning a ‘No results’ screen was not preferred, as it is
not helpful and can cause frustration to the user. The application uses soundex
keys to perform elementary error correction for terms that sound similar, but are
spelt differently due to user error. An example is shown in Fig. 7.8, where the
user input is‘lyng’. Since there are no matches in the database, the application
suggests the term ‘lung’ as a possible correction for the user to choose. The mes-
sage takes the form ‘Did you mean <suggestion> instead of <no result query>?’.
To accept the correction, the user can simply click on the provided link, instead
of having to refine the query in the search box.
94
Figure
7.8:
Err
orco
rrec
tion
wh
enin
pu
tqu
ery
is‘l
yn
g’.
Th
ecl
ose
stte
rmis
sugges
ted
,as
acl
icka
ble
lin
k.
95
CHAPTER 7. IMPLEMENTATION
7.5 Term Information Presentation
Once the user selects a term, either from the table of results or from the auto-
completion pop-up menu, the terminfo.php script is called. The script accepts
four different types of information about the term:
1. term name,
2. code,
3. preferred concept name,
4. ontology it belongs to.
This information is passed using the GET method. The terminfo.php script
produces an XHTML page which presents this information (see Figures 7.10-
7.11). Furthermore, using the term code, the SIMILARITY table is queried to
look for highly similar terms to the currently viewed term1.
Using the D3 JavaScript library, the returned terms are mapped to SVG
circles, the size of which differs, depending on their semantic similarity score to
the currently viewed term. These circles are organized in a spiral, whose central
terms are the most similar to the currently viewed term. As we move towards
the edge of the spiral, terms become less and less similar to the viewed term.
Thus, larger circles reside at the center of the spiral, and their size decreases as
we move out to the periphery. Inside each circle, a substring of the term name is
shown. When the user places the mouse cursor over a circle, a tooltip with the
full term name and semantic similarity score to the viewed term is immediately
presented (see Fig. 7.9). When the user clicks on a circle, he/she is redirected to
the particular term’s information page.
Circle size is not the only tool used for classifying terms. It would also be
desirable that the user can distinguish if a term is:
1in the term information figures presented in this thesis, Wu-Palmer semantic similarity is
being used. This can be easily changed in the terminfo.php script.
96
7.5. TERM INFORMATION PRESENTATION
Figure 7.9: When the user places the mouse cursor on a circle, a tooltip immediately appears,
containing the full term name and the semantic similarity score with the viewed term.
1. a descendant,
2. a sibling,
3. an ancestor,
4. not in the hierarchy,
when compared to the current term. To distinguish between the above cases,
different colors are used. Red is used for descendants. Green is used for ancestors
or siblings. Blue is used for terms not in the hierarchy. This last case is not valid
for NCIT (see Fi. 7.10) or ICDv9, but can be observed in MedDRA (see Fig.
7.11). When MedDRA is stripped of the leaf level (i.e., LLT terms), it can be
considered a valid hierarchy. At the same time, the removed LLT terms are not
in any hierarchy anymore, despite the fact that very close relations to PTs exist.
There must be a way to denote this type of similarity. In MedDRA, it is denoted
97
CHAPTER 7. IMPLEMENTATION
as RQ, meaning related or possibly synonymous terms.
The choice of color has dual usage. Different shades of the same color mean
that:
• due to same color, the terms are all of the same type (e.g. all ancestors of
the viewed term)
• due to different shade, each shade acts as a further grouping, denoting how
semantically close the terms are to the viewed term. For example, ancestor
terms, whose semantic similarity to the viewed term lies between 0.75 and
0.80, will have a lighter shade of green from ancestor terms, whose semantic
similarity to the initial term lies between 0.90-0.95. This color clustering
is a redundant measure; after all, circle size also clusters terms according
to their semantic similarity score. Sometimes, though, circle sizes are very
close, and the eye might be tempted to consider them as equal, so a different
color shade removes this possibility.
In addition to the D3 visualization, an XML representation of the similar terms
is provided as an alternative. It may also be used in older browsers that do
not support the JavaScript libraries used. Each term entry in XML includes
basic term information, such as name and code, and a list of similar terms, as
shown in Fig. 7.12. Finally, the page is equipped with help tooltips, that provide
information about components that are present on the page (see Fig. 7.13).
98
Figure
7.10:
Pre
senta
tion
pag
efo
rth
eN
CIT
term
‘Rec
urr
ent
NS
CL
C’.
On
the
left
sid
e,th
eb
asi
cte
rmin
form
ati
on
issh
own
,alo
ng
wit
h
anX
ML
rep
rese
nta
tion
ofh
igh
lysi
milar
term
s.O
nth
eri
ght
sid
e,a
vis
uali
zati
on
of
hig
hly
sim
ilar
term
sis
pro
vid
ed,
usi
ng
the
D3
Jav
aS
crip
t
lib
rary
.
99
CHAPTER 7. IMPLEMENTATION
Figure
7.11:
Presen
tation
page
for
the
Med
DR
Aterm
‘Rash
’.T
he
termh
as
veryclo
serela
tions
with
terms
that
aren
otin
the
hierarch
y.T
his
isillu
stratedu
sing
blu
ecolor.
100
7.6. NAVIGATION
Figure 7.12: The XML representation of a term. It includes basic term information and
highly similar terms.
Figure 7.13: Help is provided through tooltips that activate on mouse-over.
7.6 Navigation
The main pages that are presented to the user during a search are only two: in-
dex.php, which acts as the main and results presentation screen, and terminfo.php,
which provides information about a chosen concept. The user can reach a specific
term by performing four different actions:
1. by clicking on a term entry, which appears in the auto-completion pop-up
101
CHAPTER 7. IMPLEMENTATION
menu (from either index.php or terminfo.php),
2. by clicking on a term entry, which appears in the results table of index.php,
3. by clicking inside a circle in the term visualization tool in terminfo.php,
4. by clicking on a suggested correction term in index.php.
Navigation is further assisted, by exploiting the browser’s built-in functionality.
Navigating through pages can be performed through ‘Back’ and ‘Forward’ but-
tons, or explicitly through the history log of the browser. As far as individual
items are concerned, access to the search box can be achieved through the key-
board, using the ‘Tab’ button. The used jQuery plugins also support commonly
used keyboard shortcuts. As an example, the entries inside the auto-completion
pop-up menu can be selected using the ‘up’ and ‘down’ keys. Pressing the ‘Re-
turn’ key changes the page location to the appropriate term.
102
Chapter 8
Evaluation
The search application, that was developed in this project, is evaluated as follows:
• the failed queries of AstraZeneca’s previous search application are tested
again,
• the application is compared to the BioPortal online search service,
• the application’s potential use is commented on by an AstraZeneca search
specialist.
8.1 Testing the Failed Queries
In this section, the failed queries of the previous search application used at As-
traZeneca are re-tested, using the new search application that was developed in
this project. The failed queries and their reasons for failure have been given in
Tables 5.1 and 5.2 of Chapter 5. The results of testing the same queries with
the newly developed application are summarized in Table 8.1. Only two queries
did not produce better results, ‘DIHS’ and ‘NMDA Antagonist’ (see Figures 8.1
and 8.2), but this was expected behavior already from the specification; these
two terms do not appear in the supported ontologies. They are neither listed as
preferred terms, nor as synonyms, so it is normal that they cannot be found.
103
CHAPTER 8. EVALUATION
From the other terms, ‘Hepatotoxicity’ (see Fig. 8.3), ‘NSCLC’ (see Fig.
8.4) and ‘DRESS Syndrome’ (see Fig. 8.5) appear unambiguously in the auto-
completion pop-up menu, as the user starts typing, so the user can quickly jump
to the desired term page. The query ‘LHRH’ returns two different results, with
preferred names ‘GNRH1 wt Allele’ and ‘Gonadotrophin Releasing Hormone’,
respectively (see Fig. 8.6). The NCIT has listed ‘LHRH’ as synonym for both
concepts, so the user must decide which one is the desired. In contrast to the
previous search application, though, the connection between ‘Gonadotrophin Re-
leasing Hormone’ and ‘LHRH’ is clear (i.e., the former is a preferred name for the
latter), so the user does not question the validity of the results.
Finally, the query ‘VEGFR’ greatly improves the previous application’s search
results (see Fig. 8.7). The term ‘VEGFR’ appears as the best matching entity
in the results list, and contains the similar terms ‘Vascular Endothelial Growth
Factor Receptor 1’, ‘Vascular Endothelial Growth Factor Receptor 2’, ‘Vascular
Endothelial Growth Factor Receptor 3’, which are more specific terms. At this
point, it should be noted that both concepts ‘Vascular Endothelial Growth Factor
Receptor’ and ‘Vascular Endothelial Growth Factor Receptor 1’ contain ‘VEGFR’
as synonym. Since ‘VEGFR’ is the synonym which is closest lexically to the input
query (i.e. 100% match), it is the representative name for both the concepts. This
should not cause confusion, though; in both cases, the representative concept
name is immediately followed by the preferred term name.
104
8.1. TESTING THE FAILED QUERIES
Table 8.1: Testing previously failed queries.
Query Comments
DIHS The term is not found (see Fig. 8.1). This is normal, since
this abbreviation is not listed in the synonyms for the Med-
DRA term ‘Drug-induced hypersensitivity syndrome’.
NMDA
Antagonist
No results (see Fig. 8.2), since the term does not appear in
the currently supported ontologies. Also, no proposed term
for error correction.
Hepatotoxicity The term is found (see Fig. 8.3). The user can see that it
belongs to MedDRA.
NSCLC The term is found (see Fig. 8.4). The preferred name is
listed too.
DRESS
Syndrome
The term is found (see Fig. 8.5). This project’s search
application supports MedDRA LLT terms.
LHRH There are two results for ‘LHRH’ (see Fig. 8.6). Unlike in
the previous search application, the user can now see that
‘Gonadotropin Releasing Hormone’ is a preferred term for
‘LHRH’.
VEGFR Semantic similarity has grouped the similar terms together
(VEGRF-1,VEGFR-2,VEGFR-3) under the term ‘VEGFR’,
which is an enhancement to the previous search application
(see Fig. 8.7). The fact that ‘VEGFR-1’ contains ‘VEFGR’
as synonym in NCIT might confuse matters in the listing,
but the preferred term ‘Vascular Endothelial Growth Fac-
tor Receptor 1’ is also mentioned next to it, immediately
clearing any doubts.
105
CHAPTER 8. EVALUATION
Figure 8.1: The term ‘DIHS’ is not found, but this is normal, since it is not part of any of
the supported ontologies. Instead, the term ‘DIOS’ is proposed, in case the user had mispelt
the query.
Figure 8.2: The term ‘NMDA Antagonist’ is not found, but this is normal, since it is not
part of any of the supported ontologies. No soundex match is found, so no error corrections are
suggested.
Figure 8.3: The term ‘Hepatotoxicity’ is shown in the auto-completion dialogue.
Figure 8.4: The term ‘NSCLC’ is shown in the auto-completion dialogue.
106
8.1. TESTING THE FAILED QUERIES
Figure 8.5: The term ‘DRESS syndrome’ is shown in the auto-completion dialogue.
Figure 8.6: The query ‘LHRH’ produces two different 100%-matching results. Unlike in the
previous search application, the user can now see that ‘Gonadotropin Releasing Hormone’ is a
preferred term for ‘LHRH’.
107
Figure
8.7:
Th
eresu
ltsfor
the
qu
ery‘V
EG
FR
’,illu
strate
asem
antic
gro
up
ing
of
4sim
ilar
terms,
nam
ely‘V
EG
FR
’,‘V
ascular
En
doth
elial
Grow
thF
acto
rR
ecepto
r1’,
‘Vascu
lar
En
doth
elialG
rowth
Facto
rR
ecepto
r2’,
‘Vascu
lar
En
doth
elialG
rowth
Factor
Recep
tor3’.
Th
elatter
three
aregro
up
edu
nd
erth
ep
arent
term.
108
8.2. COMPARISON TO BIOPORTAL SEARCH SERVICES
8.2 Comparison to BioPortal Search Services
Among other tools, BioPortal provides an online search form that allows users
to search ontologies and terminologies for terms. Comparison of this project’s
application to BioPortal does not aim to prove one better than the other; clearly,
BioPortal is a complete, multi-feature search application that allows searching
of hundreds of ontologies and terminologies, simultaneously. The intent of the
comparison is to highlight some of the different design choices that this project
has adopted, which could further improve the usability of search services provided
by BioPortal.
The BioPortal search interface is shown in Fig. 8.8. Similarly to this project’s
search application, the interface simply contains a search box. The interface
also offers advanced options, shown in Fig. 8.9. For comparison purposes, the
advanced option to narrow search to NCIT, MedDRA and ICD9CM is used (see
Fig. 8.10).
8.2.1 Auto-completion
BioPortal search does not offer auto-completion through the main search interface
at all. For individual ontologies, BioPortal does offer auto-completion widgets,
but this is not done through the main search interface. Therefore, the user is not
helped throughout the procedure, and needs to press the ‘Return’ key to check
whether the query returns any results at all. Possibly, the justification for not
providing auto-completion could be the large number of hosted ontologies, 353
in number, as of August 2013. On the other hand, even when the user chooses a
very small subset of ontologies to search, again no auto-completion is provided.
Let us consider the auto-completion widgets for individual ontologies. The
widget for NCIT is chosen and ‘nsc’ is typed. The auto-completion pop-up menu
is shown in Fig. 8.11. This project’s auto-completion results for ‘nsc’ are shown in
Fig. 8.12. It can be observed that many of the terms present in BioPortal’s auto-
completion menu do not even contain ‘nsc’. BioPortal chooses to show only the
109
110 CHAPTER 8. EVALUATION
Figure 8.8: The BioPortal interface is a simple text box, similar to this project’s main page.
Figure 8.9: BioPortal also offers advanced options to improve the search results.
8.2. COMPARISON TO BIOPORTAL SEARCH SERVICES
Figure 8.10: Only NCIT, MedDRA and ICD9CM are chosen for searching, out of the 353
ontologies offered by BioPortal, so that comparisons to this project’s work are achievable.
preferred names for terms. Indeed, let us consider the example of ‘Becatecarin’,
shown third in BioPortal’s auto-completion menu. This term is a preferred name,
whose synonym list includes the term ‘NSC 655649’. Clearly, the search for ‘nsc’
matches ‘NSC 655649’, but instead of returning that term, BioPortal chooses to
return its preferred name, ‘Becatecarin’. Then, it is annotated as ‘synonym’,
stating that a synonym for the matching term is returned. For an inexperienced
user, this is not clear. Unless the user knows every synonym of a given concept, it
might be confusing to see result terms that do not even contain the search words.
This project’s application has alleviated this problem. Both the lexically
closest term to the query and its preferred name are shown, so the user cannot
doubt the result. This is very helpful in cases where the synonyms are highly
dissimilar. For example, the term with preferred name ‘Denatonium Benzoate’,
can be sought by any of its diverse synonyms: ‘THS-839’, ‘WIN 16568’, ‘Aversion’,
‘Anispray’ and ‘Lidocaine Benzyl Benzoate’ (see Figures 8.13-8.15).
8.2.2 Results Ranking
The main search application of BioPortal ranks results, depending on the on-
tology they belong to. Let us examine the complete search results for ‘nsclc’,
both in BioPortal’s application (see Fig. 8.16) and this project’s application (see
Fig. 8.17). BioPortal presents the closest preferred term name, and groups the
remaining results from the same ontology under this term. Each term holds a
single entity, and no hints are given about possible connections among terms. On
the other hand, our application does not group all the results of the same ontol-
ogy together. It provides another type of results grouping, according to semantic
111
CHAPTER 8. EVALUATION
Figure 8.11: Auto-completion pop-up menu of BioPortal NCIT widget when the user has
typed ‘nsc’. Only preferred terms are shown. The user might be confused when seeing the term
‘Becatecarin’ in the results, since it does not contain ‘nsc’.
Figure 8.12: Auto-completion pop-up menu of this project’s search application when the user
has typed ‘nsc’.
similarity. The user can, then, see which terms are indeed very close semantically.
The extra semantic grouping does come at the cost of extra computational power
at the server side.
112
8.2. COMPARISON TO BIOPORTAL SEARCH SERVICES
Figure 8.13: Searching for ‘Denatonium Benzoate’ through its preferred term name.
Figure 8.14: Searching for ‘Denatonium Benzoate’ through its synonym ‘THS-839’.
Figure 8.15: Searching for ‘Denatonium Benzoate’ through its synonym ‘WIN 16568’.
8.2.3 Error Correction
Error correction is not supported in BioPortal search. If the user misspells
even a letter in the query, a ‘No Matches Found’ message will appear. In this
project’s search application, soundex-based error correction is used to correct
simple spelling mistakes. The application suggests a term that might match the
intended user query. The user can simply click on the term, and is immediately
reassured that the term exists. Otherwise, the user would be uncertain, and
113
CHAPTER 8. EVALUATION
Figure 8.16: BioPortal search results rankings for ‘nsclc’. All terms are grouped according to
the ontology they belong to, under the preferred name of the most lexically-relevant term to
the query.
would possibly refer to external sources, such as Google, to identify any possible
errors. Figures 8.18-8.21 illustrate how erroneous queries are handled in the two
applications. The terms ‘nsclca’ and ‘caancer’ are used as queries. BioPortal’s
application does not offer any error correction, while our application offers the
suggestion of terms ‘nsclc’ and ‘cancer’.
8.2.4 Visualization
BioPortal includes a visualization for each term, which illustrates the term’s po-
sition in the hierarchy (see Fig. 8.22). In our application, the visualization is
simplified, and does not refer to formal logic syntax (e.g. subclassOf). Our
114
8.2. COMPARISON TO BIOPORTAL SEARCH SERVICES
Figure 8.17: This project’s search results rankings for ‘nsclc’. Terms in the results are
rearranged into groups that show high semantic similarity.
Figure 8.18: BioPortal returns no search results for the erroneously spelt term ‘nsclca’.
Figure 8.19: BioPortal returns no search results for the erroneously spelt term ‘caancer’.
application attempts to hide the underlying ontology and simplify the data vi-
sualization, so that inexperienced users can search without being consumed by
115
CHAPTER 8. EVALUATION
Figure 8.20: This project’s search application returns a search suggestion of ‘nsclc’ for the
erroneously spelt term ‘nsclca’.
Figure 8.21: This project’s search application returns a search suggestion of ‘cancer’ for the
erroneously spelt term ‘caancer’.
Figure 8.22: BioPortal uses a graph to visualize hierarchical relations. Edges are annotated
with a description of the relationship between the connected nodes (e.g. subclassOf).
116
8.3. COMMENTS FROM AN ASTRAZENECA SEARCH SPECIALIST
Figure 8.23: This project’s application focuses on inexperienced users and attempts to com-
pletely hide any formal-logic relationships that might confuse the user.
formal-logic references that would puzzle them. (see Fig. 8.23). Allowing users
to choose between the two visualizations would be ideal, so that users of different
levels all benefit.
8.3 Comments from an AstraZeneca Search Spe-
cialist
This second part of the evaluation attempts to examine the search application’s
potential use in the area of medical knowledge acquisition. A short interview
was conducted with a search specialist in research and development information
at AstraZeneca. The search specialist is a researcher, responsible for running
literature searches that ensure patient safety and other functions (e.g. the pre-
diction of drug efficacy and safety at an early stage during drug development).
117
CHAPTER 8. EVALUATION
Figure 8.24: Search results depicting causal associations between smoking and cancer, as
presented by the I2E text mining application.
In particular, the search specialist needs to examine the presence of certain term
relationships and patterns in a corpus of medical research documents, which are
retrieved from databases such as ‘Clinicaltrials.gov’.
Efficient full-text search can be achieved through a text mining application
named I2E, developed by Linguamatics. This tool features natural language pro-
cessing (NLP)-based querying. It receives an NLP query as input, searches a
predefined collection of documents, and presents the relevant results in a struc-
tured format. As an example, let us assume that the searcher wishes to search
through a list of medical documents for associations of smoking and cancer. The
terms ‘smoking’ and ‘cancer’ are entered, along with the base form of the verb
‘cause’, to denote the association. The results are shown in Fig. 8.24. Each result
row indicates the document in which the specified hit appears, and provides a
textual excerpt of its context within the document. The tool also features plain
search for terms within a set of ontologies, as shown in Fig. 8.25. Each result row
contains the preferred term name, code and path of the term’s parent to root.
To achieve full results coverage, the search specialist needs to ensure that all
possible variations of the input query have been examined. For example, an input
query of the form ‘has adverse event been seen in MEK inhibitors?’ should con-
sider all possible synonyms of terms that compose the query. The term ‘MEK in-
hibitor’ may be present in literature in various forms, including ‘MKK Inhibitor’,
‘MAPK/ERK Kinase Inhibitor’, ‘MAP2K Inhibitor’, and ‘MAPKK Inhibitor’.
The term ‘adverse event’ may also be found as ‘AdverseEvent’, ‘Adverse Expe-
rience’ or ‘AE’. Similarly, the verb ‘cause’ might as well be replaced by similar
118
8.3. COMMENTS FROM AN ASTRAZENECA SEARCH SPECIALIST
Figure 8.25: Search results for the term ‘MEK inhibitor’ in NCIT, when the I2E application
is used.
verb base forms such as ‘associate’ or ‘result’. Furthermore, when the number of
results is too large, the search specialist should be able to quickly refine the input
query and target it to more specific terms.
The search application developed for this project can assist in finding syn-
onyms for biomedical terms, and in quickly changing the granularity of searches.
Each term page presents a complete list of synonyms for that term, retrieved
from an up-to-date version of the ontology that the term belongs to. Further-
119
CHAPTER 8. EVALUATION
more, visualizations offer quick browsing of similar terms, both of higher and
lower specificity. For example, by following red circles, the searcher can delve
deeper into the hierarchy and immediately view information about more specific
terms, without need for re-searching.
The search expert’s comments about the application were very positive. It
was commented that the application would be very helpful for refining queries
before feeding them to a tool like I2E. The interface was considered simple and
the search procedure intuitive. The auto-completion feature and the presence of
lexical similarity scores in the rankings greatly simplified the search procedure,
and allowed the search specialist to quickly reach her goal and focus on the result,
and not on the means to reach the result. Visualization of suggested terms was
valued most of all. Through the developed application, the search specialist could
easily browse neighborhoods of similar terms and refine the search granularity on-
demand. The usage of colors instead of typical expanding menu hierarchies was
also complimented for its usability.
120
Chapter 9
Conclusions and Future Work
Ontologies are expected to play a major role in the discovery of new knowledge
within the biomedical sector. Providing user-friendly tools that help researchers
navigate efficiently through ontologies, without requiring from them to fully com-
prehend ontological principles, is more likely to help them reach their final goals
quickly, without confusion and frustration.
9.1 Conclusions
In this thesis, proposals have been made for enhancing the user experience in
ontological search, through the design of a search application that features en-
hanced searching tools such as auto-completion, semantic grouping of results,
query reformulation and similar concept suggestion. The outcome is a web-based
application that allows searching and browsing ontologies of heterogeneous struc-
ture and format. The web application utilizes all the latest web technologies to
produce a user-friendly environment.
Focus has been given on promoting usability and positive user experience, by
designing the search service from a user-centric perspective, such that even inex-
perienced users can become quickly acquainted with it. The search application
relies heavily on pre-calculated semantic similarity scores; semantic similarity al-
121
CHAPTER 9. CONCLUSIONS AND FUTURE WORK
lows expressing the relationships between terms as decimal numbers, in the range
[0, 1]. Mapping term relations to real numbers allows for the development of the
innovative visualizations and results clustering that are used in this application.
The chosen design for the search application manages to improve certain aspects
that even enterprise-strength ontological search applications, such as BioPortal,
have not considered yet.
9.2 Future Work
The application can be further improved in the following ways:
• it may be connected to other medical applications. For example, it may
assist in directly feeding lists of terms to text mining applications.
• it may be enhanced to accept ontologies of OWL and Open Biomedical
Ontologies (OBO) formats. Currently, BioPortal versions of ontologies are
used to populate the local database, so the application relies on BioPortal.
• more features may be added to the interface, including advanced options
for searches, such as searching by code or searching only specific ontologies.
• the update of ontology versions and calculation of semantic similarities
could be automated, by checking BioPortal at fixed time intervals.
• it may be improved to be compatible with previous versions of web browsers.
Since it relies heavily on JavaScript and novel libraries, alternative meth-
ods for presenting visualizations might be needed. Currently, it has been
successfully tested in the latest versions of all major browsers.
122
Bibliography
Adamusiak, T., Burdett, T., Kurbatova, N., van der Velde, K. J., Abeygu-
nawardena, N., Antonakaki, D., Kapushesky, M., Parkinson, H., and Swertz,
M. A. (2011). Ontocat–simple ontology search and integration in java, r and
rest/javascript. BMC bioinformatics, 12(1):218.
Al-Mubaid, H. and Nguyen, H. A. (2006). A cluster-based approach for semantic
similarity in the biomedical domain. In Engineering in Medicine and Biology
Society, 2006. EMBS’06. 28th Annual International Conference of the IEEE,
pages 2713–2717. IEEE.
Ananiadou, S. and McNaught, J. (2006). Text mining for biology and biomedicine.
Artech House Boston, London.
Anick, P. and Kantamneni, R. G. (2008). A longitudinal study of real-time
search assistance adoption. In Proceedings of the 31st annual international
ACM SIGIR conference on Research and development in information retrieval,
pages 701–702. ACM.
Bates, M. J. (1989). The design of browsing and berrypicking techniques for the
online search interface. Online Information Review, 13(5):407–424.
Belkin, N. J., Kelly, D., Kim, G., Kim, J.-Y., Lee, H.-J., Muresan, G., Tang, M.-
C., Yuan, X.-J., and Cool, C. (2003). Query length in interactive information
retrieval. In Proceedings of the 26th annual international ACM SIGIR con-
123
BIBLIOGRAPHY
ference on Research and development in informaion retrieval, pages 205–212.
ACM.
Ceusters, W., Smith, B., and Goldberg, L. (2005). A terminological and on-
tological analysis of the nci thesaurus. Methods of information in medicine,
44(4):498.
Chen, S., Ma, B., and Zhang, K. (2009). On the similarity metric and the distance
metric. Theoretical Computer Science, 410(24):2365–2376.
Clarke, C. L., Agichtein, E., Dumais, S., and White, R. W. (2007). The influence
of caption features on clickthrough patterns in web search. In Proceedings of
the 30th annual international ACM SIGIR conference on Research and devel-
opment in information retrieval, pages 135–142. ACM.
Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. Wiley-
interscience.
Cucerzan, S. and Brill, E. (2004). Spelling correction as an iterative process
that exploits the collective knowledge of web users. In Proceedings of EMNLP,
volume 4, pages 293–300.
Davis, R., Shrobe, H., and Szolovits, P. (1993). What is a knowledge representa-
tion? AI magazine, 14(1):17.
Dennis, S., Robert, M., and Bmza, P. (1998). Searching the world wide web
made easy? the cognitive load imposed by query refinement mechanisms. In
Proceedings of ADCS 98 Third Australian Document Computing Symposium,
page 65.
Franzen, K. and Karlgren, J. (2000). Verbosity and interface design. SICS Re-
search Report.
124
BIBLIOGRAPHY
Gangemi, A., Pisanelli, D., and Steve, G. (1998). Ontology integration: Experi-
ences with medical terminologies. In Formal ontology in information systems,
volume 46, pages 98–94. IOS Press, Amsterdam, AM.
Gomaa, W. H. and Fahmy, A. A. (2013). Article: A survey of text similarity
approaches. International Journal of Computer Applications, 68(13):13–18.
Published by Foundation of Computer Science, New York, USA.
Gruber, T. R. et al. (1995). Toward principles for the design of ontologies
used for knowledge sharing. International journal of human computer stud-
ies, 43(5):907–928.
Guarino, N. (1998). Formal Ontology in Information Systems: Proceedings of
the 1st International Conference June 6-8, 1998, Trento, Italy, volume 46. Ios
PressInc.
Gusfield, D. (1997). Algorithms on strings, trees and sequences: computer science
and computational biology. Cambridge University Press.
Hearst, M. (2009). Search user interfaces. Cambridge University Press.
Hertzum, M. and Frøkjær, E. (1996). Browsing and querying in online documenta-
tion: a study of user interfaces and the interaction process. ACM Transactions
on Computer-Human Interaction (TOCHI), 3(2):136–161.
Huang, C.-r., Calzolari, N., Gangemi, A., Lenci, A., Oltramari, A., and Prevot, L.
(2010). Ontology and the Lexicon: A Natural Language Processing Perspective.
Cambridge University Press Cambridge.
Hustadt, U. et al. (1994). Do we need the closed-world assumption in knowledge
representation. Working Notes of the KI, 94:24–26.
Jansen, B. J., Spink, A., and Koshman, S. (2007). Web searcher interaction
with the dogpile.com metasearch engine. Journal of the American Society for
Information Science and Technology, 58(5):744–755.
125
BIBLIOGRAPHY
Jansen, B. J., Spink, A., and Pedersen, J. (2005). A temporal comparison of al-
tavista web searching. Journal of the American Society for Information Science
and Technology, 56(6):559–570.
Jaro, M. A. (1989). Advances in record-linkage methodology as applied to match-
ing the 1985 census of tampa, florida. Journal of the American Statistical
Association, 84(406):414–420.
Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics
in medicine, 14(5-7):491–498.
Jiang, J. and Conrath, D. (1997). Semantic similarity based on corpus statistics
and lexical taxonomy. In Proc. of the Int’l. Conf. on Research in Computational
Linguistics, pages 19–33.
Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. (2005). Accu-
rately interpreting clickthrough data as implicit feedback. In Proceedings of
the 28th annual international ACM SIGIR conference on Research and devel-
opment in information retrieval, pages 154–161. ACM.
Jones, W., Dumais, S., and Bruce, H. (2002). Once found, what then? a study
of keeping behaviors in the personal use of web information. Proceedings of the
American Society for Information Science and Technology, 39(1):391–402.
Jurafsky, D. and Martin, J. H. (2000). Speech & Language Processing. Pearson
Education India.
Kukich, K. (1992). Techniques for automatically correcting words in text. ACM
Computing Surveys (CSUR), 24(4):377–439.
Leacock, C. and Chodorow, M. (1998). Combining local context and wordnet sim-
ilarity for word sense identification. WordNet: An electronic lexical database,
49(2):265–283.
126
BIBLIOGRAPHY
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions,
and reversals. Technical Report 8.
Li, M., Zhang, Y., Zhu, M., and Zhou, M. (2006). Exploring distributional
similarity based models for query spelling correction. In Proceedings of the 21st
International Conference on Computational Linguistics and the 44th annual
meeting of the Association for Computational Linguistics, pages 1025–1032.
Association for Computational Linguistics.
Li, Y., Bandar, Z. A., and McLean, D. (2003). An approach for measuring se-
mantic similarity between words using multiple information sources. Knowledge
and Data Engineering, IEEE Transactions on, 15(4):871–882.
Liu, H., Johnson, S. B., and Friedman, C. (2002). Automatic resolution of am-
biguous terms based on machine learning and conceptual relations in the umls.
Journal of the American Medical Informatics Association, 9(6):621–636.
McGuinness, D. L., Van Harmelen, F., et al. (2004). Owl web ontology language
overview. W3C recommendation, 10(2004-03):10.
Miller, G. A. (1995). Wordnet: a lexical database for english. Communications
of the ACM, 38(11):39–41.
Muramatsu, J. and Pratt, W. (2001). Transparent queries: investigation users’
mental models of search engines. In Proceedings of the 24th annual international
ACM SIGIR conference on Research and development in information retrieval,
pages 217–224. ACM.
Navarro, G. (2001). A guided tour to approximate string matching. ACM com-
puting surveys (CSUR), 33(1):31–88.
Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet,
C., Rubin, D. L., Storey, M.-A., Chute, C. G., et al. (2009). Bioportal: on-
tologies and integrated data resources at the click of a mouse. Nucleic acids
research, 37(suppl 2):W170–W173.
127
BIBLIOGRAPHY
Obendorf, H., Weinreich, H., Herder, E., and Mayer, M. (2007). Web page re-
visitation revisited: implications of a long-term click-stream study of browser
usage. In Proceedings of the SIGCHI conference on Human factors in comput-
ing systems, pages 597–606. ACM.
Petrakis, E. G., Varelas, G., Hliaoutakis, A., and Raftopoulou, P. (2006). X-
similarity: computing semantic similarity between concepts from different on-
tologies. Journal of Digital Information Management, 4(4):233.
Rada, R., Mili, H., Bicknell, E., and Blettner, M. (1989). Development and
application of a metric on semantic nets. Systems, Man and Cybernetics, IEEE
Transactions on, 19(1):17–30.
Resnik, P. (1995). Using information content to evaluate semantic similarity in a
taxonomy. arXiv preprint cmp-lg/9511007.
Rodrıguez, M. A. and Egenhofer, M. J. (2003). Determining semantic similarity
among entity classes from different ontologies. Knowledge and Data Engineer-
ing, IEEE Transactions on, 15(2):442–456.
Rodrıguez, M. A., Egenhofer, M. J., and Rugg, R. D. (1999). Assessing semantic
similarities among geospatial feature class definitions. In Interoperating Geo-
graphic Information Systems, pages 189–202. Springer.
Ruthven, I. and Lalmas, M. (2003). A survey on the use of relevance feedback for
information access systems. The Knowledge Engineering Review, 18(02):95–
145.
Sanchez, D., Batet, M., and Isern, D. (2011). Ontology-based information content
computation. Knowledge-Based Systems, 24(2):297–303.
Sanchez, D., Sole-Ribalta, A., Batet, M., and Serratosa, F. (2012). Enabling
semantic similarity estimation across multiple ontologies: An evaluation in the
biomedical domain. Journal of Biomedical Informatics, 45(1):141–155.
128
BIBLIOGRAPHY
Schulz, S., Schober, D., Tudose, I., and Stenzhorn, H. (2010). The pitfalls of
thesaurus ontologization–the case of the nci thesaurus. In AMIA Annual Sym-
posium Proceedings, volume 2010, page 727. American Medical Informatics
Association.
Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic information content metric
for semantic similarity in wordnet. In ECAI, volume 16, page 1089. Citeseer.
Sutcliffe, A. and Ennis, M. (1998). Towards a cognitive theory of information
retrieval. Interacting with computers, 10(3):321–351.
Tauscher, L. and Greenberg, S. (1997). How people revisit web pages: Empiri-
cal findings and implications for the design of history systems. International
Journal of Human-Computer Studies, 47(1):97–137.
Tversky, A. et al. (1977). Features of similarity. Psychological review, 84(4):327–
352.
Tversky, A. and Kahneman, D. (1975). Judgment under uncertainty: Heuristics
and biases. Springer.
VHA, V. H. A. (2012). National Drug File Reference Terminology (NDF-RT)
Documentation. U.S. Department of Veterans Affairs.
WHO, W. H. O. (1992). International Statistical Classification of Diseases and
Related Health Problems, Tenth Revision: Introduction; list of three-character
categories; tabular list of inclusions and four-character subcategories; morphol-
ogy of neoplams; special tabulation lists for mortality and morbidity; definitions;
regulations. World Health Organization.
Winkler, W. E. (1999). The state of record linkage and current research problems.
In Statistical Research Division, US Census Bureau. Citeseer.
129
BIBLIOGRAPHY
Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In Proceed-
ings of the 32nd annual meeting on Association for Computational Linguistics,
pages 133–138. Association for Computational Linguistics.
Zhang, H. and Zhao, S. (2011). Measuring web page revisitation in tabbed brows-
ing. In Proceedings of the 2011 annual conference on Human factors in com-
puting systems, pages 1831–1834. ACM.
Zhou, Z., Wang, Y., and Gu, J. (2008). A new model of information content
for semantic similarity in wordnet. In Future Generation Communication and
Networking Symposia, 2008. FGCNS’08. Second International Conference on,
volume 3, pages 85–89. IEEE.
Zhu, S., Zeng, J., and Mamitsuka, H. (2009). Enhancing medline document clus-
tering by incorporating mesh semantic similarity. Bioinformatics, 25(15):1944–
1951.
130