Download - Enhanced Ontological Searching of Medical Scienti c Information · 2014. 1. 7. · 2.3.2 NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.3 ICD-10 . . . . . .

University of Manchester

School of Computer Science

Degree Programme of Advanced Computer Science

Enhanced Ontological Searching of

Medical Scientific Information

Christos Karaiskos

A dissertation submitted to The University of Manchester for the degree of

Master of Science in the Faculty of Engineering and Physical Sciences

Master’s Thesis

2013

2

Contents

Abstract 7

Declaration 9

Intellectual Property Statement 11

Acknowledgements 13

List of Abbreviations 15

List of Tables 17

List of Figures 19

1 Introduction 25

1.1 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2 Ontologies 31

2.1 Modern Ontology Definition . . . . . . . . . . . . . . . . . . . . . 31

2.2 Ontology vs. Terminology . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Notable Biomedical Ontologies and Terminologies . . . . . . . . . 34

2.3.1 SNOMED CT . . . . . . . . . . . . . . . . . . . . . . . . . 34

3

2.3.2 NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3.3 ICD-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.4 MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3.5 NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Similarity Metrics 39

3.1 Similarity Metric vs. Distance Metric . . . . . . . . . . . . . . . . 39

3.2 Lexical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Character-based Similarity Measures . . . . . . . . . . . . 41

Longest Common Substring . . . . . . . . . . . . . . . . . 41

Hamming Similarity . . . . . . . . . . . . . . . . . . . . . 41

Levenshtein Similarity . . . . . . . . . . . . . . . . . . . . 41

Jaro Similarity . . . . . . . . . . . . . . . . . . . . . . . . 42

Jaro-Winkler Similarity . . . . . . . . . . . . . . . . . . . 42

N-gram Similarity . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Word-based Similarity Measures . . . . . . . . . . . . . . . 43

Dice Similarity . . . . . . . . . . . . . . . . . . . . . . . . 43

Jaccard Similarity . . . . . . . . . . . . . . . . . . . . . . . 44

Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . 44

Manhattan Similarity . . . . . . . . . . . . . . . . . . . . . 44

Euclidean Similarity . . . . . . . . . . . . . . . . . . . . . 45

3.3 Ontological Semantic Similarity . . . . . . . . . . . . . . . . . . . 45

3.3.1 Intra-ontology Semantic Similarity . . . . . . . . . . . . . 45

Distance-based Metrics . . . . . . . . . . . . . . . . . . . . 45

Information-Based Metrics . . . . . . . . . . . . . . . . . . 48

Feature-Based Measures . . . . . . . . . . . . . . . . . . . 52

3.3.2 Inter-ontology Semantic Similarity . . . . . . . . . . . . . 52

4 Search Interfaces 55

4.1 Information Seeking Models . . . . . . . . . . . . . . . . . . . . . 55

4.2 Query Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4

4.3 Presentation of Search Results . . . . . . . . . . . . . . . . . . . . 60

4.4 Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Requirements 65

5.1 Feature Specification . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 Design 69

6.1 Stage I: Access to Medical Ontologies . . . . . . . . . . . . . . . . 69

6.1.1 Database and Table Creation . . . . . . . . . . . . . . . . 70

6.1.2 Populating the Database Tables . . . . . . . . . . . . . . . 72

6.2 Stage II: Computation of Semantic Similarity . . . . . . . . . . . 76

6.2.1 Term Neighborhoods . . . . . . . . . . . . . . . . . . . . . 76

6.2.2 Semantic Similarity Calculation . . . . . . . . . . . . . . . 77

6.3 Stage III: Interface Design Data Presentation . . . . . . . . . . . 79

6.4 Summary of Technology Choices . . . . . . . . . . . . . . . . . . . 80

7 Implementation 83

7.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.2 Search Entry Form . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.3 Handling the Input Query . . . . . . . . . . . . . . . . . . . . . . 88

7.3.1 Typing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.3.2 Querying the Database . . . . . . . . . . . . . . . . . . . . 88

7.3.3 Ranking and Grouping of Search Results . . . . . . . . . . 89

7.3.4 Return-key or Mouse-click Search . . . . . . . . . . . . . . 91

7.3.5 Auto-completion Search . . . . . . . . . . . . . . . . . . . 91

7.4 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.5 Term Information Presentation . . . . . . . . . . . . . . . . . . . 96

7.6 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8 Evaluation 103

8.1 Testing the Failed Queries . . . . . . . . . . . . . . . . . . . . . . 103

8.2 Comparison to BioPortal Search Services . . . . . . . . . . . . . . 109

5

8.2.1 Auto-completion . . . . . . . . . . . . . . . . . . . . . . . 109

8.2.2 Results Ranking . . . . . . . . . . . . . . . . . . . . . . . . 111

8.2.3 Error Correction . . . . . . . . . . . . . . . . . . . . . . . 113

8.2.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.3 Comments from an AstraZeneca Search Specialist . . . . . . . . . 117

9 Conclusions and Future Work 121

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Bibliography 123

Number of Words in the Document: 25648

6

University of Manchester

School of Computer Science

Degree Programme of Advanced Computer Science

ABSTRACT OF

MASTER’S THESIS

Author: Christos Karaiskos

Title: Enhanced Ontological Searching of Medical Scientific Information

Supervisors: Prof. Andrew Brass (University of Manchester)

Dr. Jennifer Bradford (AstraZeneca)

Abstract: An enormous amount of biomedical knowledge is encoded in narra-

tive textual format. In an attempt to discover new or hidden knowledge, exten-

sive research is being conducted to extract and exploit term relationships from

plain text, with the aid of technology. A common approach for the identification

of biomedical entities in plain text involves usage of ontologies, i.e., knowledge

bases which provide formal machine-understandable representations of domains

of variable specificity. In addition to term extraction, ontologies may be used

as controlled vocabularies or as a means for automatic knowledge acquisition

through their inherent inference capabilities. Visualization of the content of on-

tologies is, thus, very important for researchers in the biomedical domain. Un-

fortunately, many of these researchers find it difficult to deal with formal logic

and would prefer that ontology search interfaces completely hide any structural

or functional references to ontologies. This thesis proposes a strategy for build-

ing a web-based ontology search application that exploits ontologies behind the

scene, transparently from the end user, and presents relevant concept informa-

tion in such a way that searchers can successfully and quickly find what they

are looking for. The proposed search interface features various search tools for

enhanced ontological searching, including term auto-completion, error correction,

clever results ranking, and similar term visualizations based on semantic similar-

ity metrics. Evaluation of the developed application shows that its features can

improve enterprise-strength ontology search applications, such as BioPortal.

Keywords: search interface design, ontology hiding, biomedical ontology,

semantic similarity, usability, data integration

7

8

Declaration

No portion of the work referred to in the dissertation has been submitted in

support of an application for another degree or qualification of this or any other

university or other institute of learning.

9

10

Intellectual Property Statement

i. The author of this dissertation (including any appendices and/or schedules

to this dissertation) owns certain copyright or related rights in it (the ‘Copy-

right’) and he has given The University of Manchester certain rights to use

such Copyright, including for administrative purposes.

ii. Copies of this dissertation, either in full or in extracts and whether in hard

or electronic copy, may be made only in accordance with the Copyright,

Designs and Patents Act 1988 (as amended) and regulations issued under

it or, where appropriate, in accordance with licensing agreements which the

University has entered into. This page must form part of any such copies

made.

iii. The ownership of certain Copyright, patents, designs, trade marks and other

intellectual property (the ‘Intellectual Property’) and any reproductions of

copyright works in the dissertation, for example graphs and tables (‘Repro-

ductions’), which may be described in this dissertation, may not be owned by

the author and may be owned by third parties. Such Intellectual Property

and Reproductions cannot and must not be made available for use with-

out the prior written permission of the owner(s) of the relevant Intellectual

Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication

and commercialisation of this dissertation, the Copyright and any Intel-

lectual Property and/or Reproductions described in it may take place is

11

available in the University IP Policy (see http://documents.manchester.ac.

uk/display.aspx?DocID=487), in any relevant Dissertation restriction decla-

rations deposited in the University Library, The University Library’s reg-

ulations (see http://www.manchester.ac.uk/library/aboutus/regulations)

and in The University’s Guidance for the Presentation of Dissertations.

12

http://documents.manchester.ac.uk/display.aspx?DocID=487

http://documents.manchester.ac.uk/display.aspx?DocID=487

http://www.manchester.ac.uk/library/aboutus/regulations

Acknowledgements

I am deeply grateful to my supervisors, Prof. Andrew Brass (University of Manch-

ester) and Dr. Jennifer Bradford (AstraZeneca), for their invaluable guidance and

support throughout the duration of this project. I have greatly benefited from

experiencing the different perspectives of academia and industry, which have both

contributed to shaping the final outcome of this project.

I would like to thank Sebastian Philipp Brandt (University of Manchester),

for his suggestions on making the search application even better. Also, I would

like to express my gratitude to Julie Mitchell (AstraZeneca), for taking the time

to evaluate the application, and Paul Metcalfe (AstraZeneca), for his advice on

improving the performance and security of the application.

Finally, I would like to thank Matina for her patience and love, and my par-

ents, Ioannis and Stavroula, for always being there.

13

14

List of Abbreviations

AI Artificial Intelligence

AJAX Asynchronous JavaScript and XML

API Application Programming Interface

CSS Cascading Style Sheets

DAG Directed Acyclic Graph

HLGT High Level Group Term

HLT High Level Term

HTTP Hypertext Transfer Protocol

IC Information Content

ICD International Classification of Diseases

JDBC Java Database Connectivity

JSON JavaScript Object Notation

LCS Least Common Subsumer

MedDRA Medical Dictionary for Regulatory Activities

NCIT National Cancer Institute Thesaurus

NDF-RT National Drug File Reference Terminology

15

NHS UK National Health System

NLP Natural Language Processing

OBO Open Biomedical Ontologies

OWL Web Ontology Language

PHP PHP Hypertext Preprocessor

PT Preferred Term

RDF Resource Description Framework

RDF-S Resource Description Framework Schema

REST Representational State Transfer

RF2 Release Format 2

SNOMED CT Systematized Nomenclature of Medicine Clinical Terms

SNOMED RT Systematized Nomenclature of Medicine Reference

Terminology

SOC System Organ Class

UMLS Unified Medical Language System

URI Uniform Resource Identifier

URL Uniform Resource Locator

UX User Experience

VA U.S. Department of Veterans Affairs

WHO World Health Organization

XHTML Extensible HyperText Markup Language

XML Extensible Markup Language

16

List of Tables

5.1 Documented failed queries and suggested reasons for failure. . . . 66

5.2 Documented failed queries and suggested reasons for failure (cont.). 67

6.1 ‘Ontologies’ database table structure . . . . . . . . . . . . . . . . 71

6.2 Examples of URI formats for BioPortal RESTful services. . . . . . 73

6.3 Technology choices for the project. . . . . . . . . . . . . . . . . . 81

7.1 PHP files used in the search application. . . . . . . . . . . . . . . 85

7.2 XHTML files used in the search application. . . . . . . . . . . . . 85

7.3 CSS files used in the search application. . . . . . . . . . . . . . . 86

7.4 JavaScript files used in the search application. . . . . . . . . . . . 86

8.1 Testing previously failed queries. . . . . . . . . . . . . . . . . . . . 105

17

18

List of Figures

2.1 The structure of the MedDRA terminology comprises a fixed-depth

hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 The google search engine entry form. . . . . . . . . . . . . . . . . 57

4.2 Facebook uses grayed-out descriptive text to help in the formula-

tion of user queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Bing’s search interface features a powerful dynamic search sugges-

tion, where prefixes are highlighted with grayed-out font and the

remaining text is in bold. . . . . . . . . . . . . . . . . . . . . . . 58

4.4 The Safari browser’s embedded search interface explicitly states

which queries are suggestions and which belong to the user’s recent

search history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 The Firefox browser’s embedded search interface contains recent

queries on top, and separates them from suggestions using a solid

line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6 Google’s search results page is a typical scrollable vertical list of

captions. Metadata facets, that restrain results to a particular

type of information, are also present in the interface (e.g. ‘Images’

tab). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.7 Amazon’s search interface provides facets as a left panel to the

results page, helping the user dynamically refine the initial search. 62

19

4.8 Pubmed’s results page includes term expansion in two ways. On

the right of the screen, there is a ‘Related searches’ panel that pre-

serves the initial query and adds a new related term to it. Also,

right below the entry form there is a ‘See also’ feature which sug-

gests complete or partial modifications in the initial query. . . . . 64

6.1 A part of the XML response for the ‘get all terms’ query of Table

6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 The provided methods of the ontoCAT API Adamusiak et al. (2011). 75

6.3 Populating the ‘Ontologies’ database is performed with the help of

the ontoCAT API. . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.1 The organization of the files that comprise the web application.

These files are responsible for the presentation, styling and inter-

active behavior of the web application. . . . . . . . . . . . . . . . 84

7.2 The main window of the search application. The search box is

placed at the top of the screen, with central horizontal alignment.

A submit button labeled ‘Search’ is also provided, to assist users

that prefer mouse-clicking. . . . . . . . . . . . . . . . . . . . . . . 87

7.3 Once the user clicks inside the search box, the grey help message

disappears and a blinking cursor takes its place. . . . . . . . . . . 87

7.4 Terms, that would appear on their own table row, are grouped

under a more lexically-matching term to the query, when their

semantic similarity to that term is higher than a threshold. . . . . 90

7.5 Pressing the ‘Return’ key or clicking the ‘Search’ button submits

the query to index.php and a table of search results is added to the

interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.6 Part of the JSON response from performQuery.php, for the input

query ‘rash’. Each JSON object represents a term matching the

query, and contains information that can be used for its presentation. 93

20

7.7 Pressing any other key except ‘Return’ submits the query through

AJAX to performQuery.php and an auto-completion pop-up menu

is created from the JSON response. . . . . . . . . . . . . . . . . . 93

7.8 Error correction when input query is ‘lyng’. The closest term is

suggested, as a clickable link. . . . . . . . . . . . . . . . . . . . . 95

7.9 When the user places the mouse cursor on a circle, a tooltip imme-

diately appears, containing the full term name and the semantic

similarity score with the viewed term. . . . . . . . . . . . . . . . . 97

7.10 Presentation page for the NCIT term ‘Recurrent NSCLC’. On the

left side, the basic term information is shown, along with an XML

representation of highly similar terms. On the right side, a visual-

ization of highly similar terms is provided, using the D3 JavaScript

library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.11 Presentation page for the MedDRA term ‘Rash’. The term has

very close relations with terms that are not in the hierarchy. This

is illustrated using blue color. . . . . . . . . . . . . . . . . . . . . 100

7.12 The XML representation of a term. It includes basic term infor-

mation and highly similar terms. . . . . . . . . . . . . . . . . . . 101

7.13 Help is provided through tooltips that activate on mouse-over. . . 101

8.1 The term ‘DIHS’ is not found, but this is normal, since it is not

part of any of the supported ontologies. Instead, the term ‘DIOS’

is proposed, in case the user had mispelt the query. . . . . . . . . 106

8.2 The term ‘NMDA Antagonist’ is not found, but this is normal,

since it is not part of any of the supported ontologies. No soundex

match is found, so no error corrections are suggested. . . . . . . . 106

8.3 The term ‘Hepatotoxicity’ is shown in the auto-completion dialogue.106

8.4 The term ‘NSCLC’ is shown in the auto-completion dialogue. . . . 106

8.5 The term ‘DRESS syndrome’ is shown in the auto-completion di-

alogue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

21

8.6 The query ‘LHRH’ produces two different 100%-matching results.

Unlike in the previous search application, the user can now see that

‘Gonadotropin Releasing Hormone’ is a preferred term for ‘LHRH’. 107

8.7 The results for the query ‘VEGFR’, illustrate a semantic grouping

of 4 similar terms, namely ‘VEGFR’, ‘Vascular Endothelial Growth

Factor Receptor 1’, ‘Vascular Endothelial Growth Factor Receptor

2’, ‘Vascular Endothelial Growth Factor Receptor 3’. The latter

three are grouped under the parent term. . . . . . . . . . . . . . . 108

8.8 The BioPortal interface is a simple text box, similar to this project’s

main page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.9 BioPortal also offers advanced options to improve the search results.110

8.10 Only NCIT, MedDRA and ICD9CM are chosen for searching, out

of the 353 ontologies offered by BioPortal, so that comparisons to

this project’s work are achievable. . . . . . . . . . . . . . . . . . . 111

8.11 Auto-completion pop-up menu of BioPortal NCIT widget when

the user has typed ‘nsc’. Only preferred terms are shown. The

user might be confused when seeing the term ‘Becatecarin’ in the

results, since it does not contain ‘nsc’. . . . . . . . . . . . . . . . . 112

8.12 Auto-completion pop-up menu of this project’s search application

when the user has typed ‘nsc’. . . . . . . . . . . . . . . . . . . . . 112

8.13 Searching for ‘Denatonium Benzoate’ through its preferred term

name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.14 Searching for ‘Denatonium Benzoate’ through its synonym ‘THS-

839’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.15 Searching for ‘Denatonium Benzoate’ through its synonym ‘WIN

16568’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.16 BioPortal search results rankings for ‘nsclc’. All terms are grouped

according to the ontology they belong to, under the preferred name

of the most lexically-relevant term to the query. . . . . . . . . . . 114

22

8.17 This project’s search results rankings for ‘nsclc’. Terms in the re-

sults are rearranged into groups that show high semantic similarity.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.18 BioPortal returns no search results for the erroneously spelt term

‘nsclca’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.19 BioPortal returns no search results for the erroneously spelt term

‘caancer’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.20 This project’s search application returns a search suggestion of

‘nsclc’ for the erroneously spelt term ‘nsclca’. . . . . . . . . . . . 116

8.21 This project’s search application returns a search suggestion of

‘cancer’ for the erroneously spelt term ‘caancer’. . . . . . . . . . 116

8.22 BioPortal uses a graph to visualize hierarchical relations. Edges

are annotated with a description of the relationship between the

connected nodes (e.g. subclassOf). . . . . . . . . . . . . . . . . . 116

8.23 This project’s application focuses on inexperienced users and at-

tempts to completely hide any formal-logic relationships that might

confuse the user. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.24 Search results depicting causal associations between smoking and

cancer, as presented by the I2E text mining application. . . . . . 118

8.25 Search results for the term ‘MEK inhibitor’ in NCIT, when the

I2E application is used. . . . . . . . . . . . . . . . . . . . . . . . . 119

23

24

Chapter 1

Introduction

Ontologies are knowledge bases which provide formal machine-understandable

representations of domains of variable specificity. Given a domain of discourse,

concepts that belong to the domain are well documented in formal logic, along

with their inter-relations. Ontologies, as representations, cannot perfectly capture

the part of the world that they attempt to describe Davis et al. (1993). They

are based on the open world assumption, which states that if something is not

represented in a knowledge base, it does not mean that it does not exist in the

real world Hustadt et al. (1994). As our knowledge about a domain increases,

ontologies are updated and they become more complex. This has become evident

in the biomedical domain, where ontologies have already attained a high degree of

specificity, and has led to their quick adoption for data integration and knowledge

discovery purposes.

1.1 Problem Context

Within biomedicine, ontologies can help researchers communicate, by promoting

consistent use of biomedical terms and concepts. The construction of an ontol-

ogy itself involves mediating across multiple views and requires that a number

of domain experts reach a consensus that reflects the diverse viewpoints of the

25

CHAPTER 1. INTRODUCTION

community. Ontologies are viewed as tools that provide opportunities for new

knowledge acquisition, due to the complex semantic relations that they model.

Inferences in a huge ontology may reveal connections that the human eye would

bypass. This is especially important in the pharmaceutical sector, where drug

discovery has slowed down significantly as a process and in the biological sector,

where attempts to demystify genome patterns associated with disease are still

at initial stage. Another common use for ontologies in the biomedical domain

is as controlled vocabularies that feed filtered terms into computer applications.

Finally, ontologies may be used to connect terms found in plain text to their

semantic representations. Term extraction with the help of ontologies is a hot

topic in biomedicine, due to the vast amounts of medical information stored in

plain text. Due to the importance of ontologies, it is usual for researchers in the

biomedical field to require access to their content.

1.2 Motivation

In the past, AstraZeneca employees were provided with a web-based search form

that enabled them to look for concepts in one or more biomedical ontologies and

select the most suitable from a list of search results. The chosen concepts were, in

turn, conveyed to a text mining application. Understanding the results required

the user to be familiar with the content and structure of the ontology from which

the terms were retrieved. Unfortunately, most users did not feel comfortable

with the idea of ontologies and struggled, or even refused, to use the provided

interfaces, even though no logic-based content was there to confuse them.

In many cases, though, this was not solely the fault of the users. The interface

gave the users freedom to select the ontologies to be searched for the specified

query. Inexperienced users usually did not know or care about which ontology

contains the desired query term. For example, a user wished to search for ‘Non-

small cell lung carcinoma’, by its abbreviation ‘NSCLC’. Querying ‘NSCLC’ in

26

1.3. CONTRIBUTION

the MedDRA terminology1 returned no results, since the concept is not present

in the terminology. Although this behavior is correct, it seems wrong to the

inexperienced user and may lead to loss of trust to the system.

But even if the term is present in the ontology, the user should not be forced

to know its exact spelling. For example, querying for ‘NSCLC’ in the NCIT

thesaurus also returned no results, despite the fact that the actual concept exists

in the ontology. The searcher needed to know that the preferred term for the

‘NSCLC’ concept is ‘Non-small cell lung carcinoma’. Abbreviations and dissimilar

synonyms are common in the biomedical field, so expecting the user to know the

preferred term for each concept is considered problematic.

In addition to the above, presentation of results was not always straightfor-

ward. Terms that demonstrate a strong semantic relation to each other were

presented as stand-alone terms in the search results, subconsciously misleading

users to deduce that the terms were independent. It was up to the user to judge

the relevance of results to the query. For example, the results for ‘Non-small cell

lung carcinoma’ in NCIT included, among others, the terms ‘Non-small cell lung

carcinoma’ and ‘Stage I non-small cell lung carcinoma’ equally spaced, in a way

that users could not infer the connections between them. In fact, the latter term

is a specification of the former. In reality, what users did was to choose all terms,

even though they were looking for the broad term, because they became confused

and did not want to take the risk of selecting only one.

This collapse at the human-computer interface has motivated AstraZeneca to

try to build tools that take advantage of the ontology structure and, at the same

time, completely hide it from the user in order to facilitate the search procedure.

1.3 Contribution

The outcome of this thesis is the development of a user-friendly search applica-

tion that allows users to find information about concepts present in a medical

1The difference between terminology and ontology is described in Section 2.2

27

CHAPTER 1. INTRODUCTION

ontology, without requiring from them to understand the underlying structure of

the ontology. Information about a concept includes its accession code within the

given ontology, the term for its preferred name, its definition and all available

synonym terms. In order to facilitate the search procedure and enhance User

Experience (UX), the search application includes features such as dynamic term

suggestion, spelling correction and similar term visualization tools.

The main challenge lies in the presentation of results; as stated in section 1.2,

users are usually not sure about which term(s) to choose, when multiple similarly-

spelt terms appear. Ranking of terms is performed with the aid of both lexical

and semantic similarity. The former screens those terms that best match the user

query and ranks them according to a string relevance metric. These results are

processed by the latter, so that terms showing a strong semantic connection are

grouped together.

Ideally, the search application should bridge across terms from multiple ontolo-

gies. Due to the diversity in the format and annotation of different ontologies, this

is not a straightforward generalization. Most importantly, within the biomedical

society, the term ‘ontology’ is often used erroneously to describe plain termi-

nologies that, in fact, violate basic ontological principles.2 Therefore, ontology-

specific difficulties are expected to arise, if semantic similarity measures are to be

deployed.

In summary, the goals of this thesis are to investigate the following topics:

1. To develop user-friendly search tools that allow users to build search queries

based on the terms present in a medical ontology, without need for the users

to understand the actual structure of the ontology.

2. To exploit the semantic annotations of the underlying ontology in order to

enhance the quality and presentation of results.

3. To intermix results originating from different ontologies.

2In MedDRA, the synonym of a term may be a child node of the term itself.

28

1.4. THESIS ORGANIZATION

1.4 Thesis Organization

The thesis is organized in a total of 9 chapters. Chapter 2 includes an introduction

to ontologies and a brief description of some notable biomedical ontologies. Chap-

ter 3 presents the background needed for understanding the different measures

of lexical and semantic similarity. Chapter 4 discusses interface design principles

for user-centered search applications. In chapter 5, the requirements and feature

specifications for the final search application are addressed. Chapter 6 describes

the design considerations that were taken into account for the ontological search

application, while chapter 7 presents the final implementation. Chapter 8 in-

cludes the evaluation of the search application. Finally, conclusions are drawn in

chapter 9, along with possible future directions.

29

30

Chapter 2

Ontologies

The term ‘ontology’ is an uncountable noun coined in the philosophical field, by

ancient Greek philosophers Guarino (1998). It involves the study of the nature

of existence, at a fairly abstract level. In the world of computer science, the word

‘ontology’ refers to the encoding of human knowledge in a format that allows

for computational use. This chapter includes an introduction to the modern

definition of ontology, along with a brief description of some of the most notable

biomedical ontologies.

2.1 Modern Ontology Definition

In Artificial Intelligence (AI), an ontology is commonly defined as a specification

of a (shared) conceptualization Gruber et al. (1995). A conceptualization refers

to an individual’s knowledge about a specific domain, acquired through ‘expe-

rience, observation or introspection’ Huang et al. (2010). Ontologies are shared

conceptualizations, meaning that multiple participants, usually domain experts,

contribute to their construction, maintenance and expansion. Conflicts are cer-

tain to arise among the different participants, so an important aspect of ontology

design is to bridge across multiple views of the desired domain into a single con-

crete representation. On the other hand, a specification is a transformation of

31

CHAPTER 2. ONTOLOGIES

this shared conceptualization into a formal representation language.

The outcome of a formal representation of a domain is a collection of entities,

expressions and axioms. Entities include:

• concepts or classes, which are sets of individuals (e.g., ‘Country’, which

contains all countries),

• individuals, which are specific instances of classes (e.g., ‘Greece’ as an in-

stance of ‘Country’),

• data types (e.g. string, integer),

• literals, which are specific values of a given data type (e.g. 1,2,3, or string

values),

• properties (e.g. hasDisease, hasAge).

Expressions refer to descriptions of entities in a formal representation language.

The standardized family of languages for formal ontology representation is the

Web Ontology Language (OWL), which builds on the Extensible Markup Lan-

guage (XML), Resource Description Framework (RDF) and RDF-Schema (RDF-

S) standards to provide a highly expressive means for representing knowledge

McGuinness et al. (2004). The underlying format of the resulting OWL docu-

ment can vary among several types, with the most common being RDF/XML.

Finally, axioms relate entities/expressions. This connection can be made

class-to-class (i.e. SubClassOf), individual-to-class (i.e. ClassAssertion), property-

to-property (i.e. SubPropertyOf), among others. These relations can be asserted

explicitly or inferred by a reasoner. Inferences are made, based on the logic rela-

tions of concepts. As an example of a simple inference, a concept’s ancestors can

be inferred automatically, once the parent concept is specified.

An ontology may be visualized as a graph, in which concepts are nodes and

relations are edges between nodes. Furthermore, if transitive hierarchical rela-

tions are isolated (e.g. subsumption, also known as ‘is-a’ relation or hyponymy),

32

2.2. ONTOLOGY VS. TERMINOLOGY

the ontology can be viewed as a taxonomy. The geometrical visualization of an

ontology will be presented in more detail in chapter 3.

2.2 Ontology vs. Terminology

A terminology is a collection of term names that are associated with a given

domain. A term is a mapping of a concrete concept to natural language. This

term-to-concept mapping is usually not one-to-one, especially in the biomedical

domain where term variation and term ambiguities arise Ananiadou and Mc-

Naught (2006). Term variation is a result of the richness of natural language and

refers to the existence of multiple terms for the description of the same concept.

For example, the terms ‘Transmembrane 4 Superfamily Member 1’, ‘TM4SF1t’,

‘L6 Antigen’ all point to the same protein. Term ambiguity occurs when a term is

mapped to more than one distinct concept. This is common when new abbrevia-

tions are introduced Liu et al. (2002). As an example, some of the concepts that

the acronym ‘CTX’ may map to are ‘Cardiac Transplantation’, ‘Clinical Trial

exemption’ and ‘Conotoxin’. Their disambiguation is a matter of context.

A terminology is not constrained to being a simple list of terms. In fact,

most terminologies feature some kind of structure, where terms that map to the

same concept are grouped together and semantic relationships between concepts

are explicitly or implicitly stated. Semantic relationships between terms include

synonymy and antonymy, while semantic relationships between concepts include

hyponymy, hypernymy, meronymy and holonymy Jurafsky and Martin (2000).

Synonymy exists when two terms are interchangeable, while antonymy denotes

that two terms have opposite meaning. Hyponymy introduces a parent-child, or

‘is-a’ relation between concepts. A concept is a hyponym of another concept,

if the former derives from the latter and it represents a more granular concept.

Hyponymy is transitive; if concept a is a child of concept b, and concept b is a

child of concept c, then a is also a child of c. Hypernymy is the reverse relation

of hyponymy. Meronymy exists when a concept represents a part of another

33


concept. Holonymy is the opposite relation, where a concept has part some other

concept(s).

The difference between a terminology and an ontology is not always clear, as

terminologies continue to improve their state of organization in a way that resem-

bles ontologies. The initial scope and aim of the two, though, is clearly different;

the purpose of a terminology was initially, as the name implies, an effort to collect

all terms associated with a specified domain. On the other hand, the target of

an ontology has, from the start, been to provide a machine-readable specification

of a shared conceptualization. Despite their many common characteristics, ter-

minologies are not necessarily ontologies. If treated as ontologies, they may lead

to inconsistencies or wrong inferencing mechanisms Ananiadou and McNaught

(2006). An illustrative example is the case of MedDRA, which will be discussed

in Section 2.3.4.

2.3 Notable Biomedical Ontologies and Termi-

nologies

Hundreds of biomedical ontologies and terminologies have been published on-

line. According to BioPortal1 statistics, the top five most viewed ontologies or

terminologies are SNOMED Clinical terms, National Drug File, International

Classification of Diseases, MedDRA and NCI Thesaurus. In this section, a brief

introduction to these ontologies/terminologies is performed.

2.3.1 SNOMED CT

The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is a

biomedical terminology which covers most areas within medicine such as drugs,

diseases, operations, medical devices and symptoms. It may be used for the cod-

1BioPortal is a biomedical ontology/terminology repository which provides online ontology

presentation and manipulation tools (http://bioportal.bioontology.org/).

34

http://bioportal.bioontology.org/

2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES

ing, retrieval and processing of clinical data. SNOMED CT is written purely in

formal logic-based syntax (i.e., the so-called Release Format 2 or RF2) available

and organized into multiple independent hierarchies. It is the result of the merg-

ing between the UK National Health System’s (NHS) Read codes and SNOMED

Reference Terminology (SNOMED-RT), developed by the College of American

Pathologists. The basic hierarchies, or axes, are ‘Clinical Finding’ and ‘Proce-

dure’. The last version contains more than 400000 concepts and over 1000000

of relationships, rendering SNOMED CT the most complete terminology in the

medical domain. Only few definitions are present in the terminology. Each con-

cept contains a unique identifier and numerous synonymous terms that account

for term variation. Also, each concept is part of at least one hierarchy and may

have multiple ‘is-a’ relationships with higher level nodes. SNOMED CT is part

of the Unified Medical Language System (UMLS), a biomedical ontology and

terminology integration attempt which comprises hundreds of resources.

2.3.2 NDF-RT

The National Drug File Reference Terminology (NDF-RT) was introduced by the

U.S. Department of Veterans Affairs (VA) as a formalized representation for a

medication terminology, written in description logic syntax VHA (2012). The

terminology is organized into concept hierarchies, where each concept is a node

comprising a list of term synonyms and a unique identifier. As expected, top-level

concepts are more general than lower-level ones. The central hierarchy is named

DRUG KIND and indicates the types of medications, the preparations used in

them and clinical VA drug products. Other hierarchies include

• DISEASE KIND,

• INGREDIENT KIND,

• MECHANISM OF ACTION KIND,

• PHARMACOKINETICS KIND,

35


• PHYSIOLOGIC EFFECT KIND,

• THERAPEUTIC CATEGORY KIND,

• DOSE FORM and

• DRUG INTERACTION KIND.

Roles exist between different concepts, and are specified only with existential

restrictions (i.e. OWL equivalent of someValuesFrom). Mappings to other ter-

minologies are also available. Currently, NDF-RT more than 45000 concepts in

hierarchies of maximum depth 12.

2.3.3 ICD-10

The International Statistical Classification of Diseases and Related Health Prob-

lems (ICD) is a terminology which attempts to classify signs, symptoms and

causes of disease and morbidity WHO (1992). It appeared in the mid-19th cen-

tury and is now maintained by the World Health Organization (WHO). Currently

it is available in its 10th revision, although the 11th version is claimed to be at

the final stage before release. As a taxonomy, it has relatively small maximum

depth, equal to 6. Codes assigned to each concept tie it to a specific place in the

taxonomy, with each code having only a single parent. It is thus not a proper ap-

plication of ontological principles2, since, in reality, it is not unusual for concepts

to belong to more than one subsumers, and this is not modeled. In addition to

that, there exist categories such as ‘Not otherwise specified’ or ‘Other’, which are

not needed in an ontology; the open world assumption already covers the fact

that every ontology is incomplete, so stating it explicitly is redundant and may

interfere with the evolution of the ontology, as new terms are not classified under

their closest match.

2nor was meant to be; its intent is classification

36

2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES

Figure 2.1: The structure of the MedDRA terminology comprises a fixed-depth hierarchy.

2.3.4 MedDRA

The Medical Dictionary for Regulatory Activities (MedDRA) is a terminology

that is concerned with biopharmaceutical regulatory processes. It contains terms

associated with all phases of the drug development cycle. MedDRA is organized

in a hierarchical structure of fixed depth, as seen in Fig. 2.1. System Organ

Classes (SOCs) represent the 26 predefined overlapping hierarchies in which terms

belong to. High Level Group Terms (HLGTs) and High Level Terms (HLTs) are

general term groupings, denoting disorders or complications. Preferred Terms

(PTs) denote the preferred name for a concept, while Lowest Level Terms (LLTs)

include terms of maximum specificity. LLTs may be connected with hyponymy,

meronymy or synonymy relationships to their PTs. This is the main problem in

trying to view MedDRA as an ontology. In a formal ontology, a concept cannot

be a child of itself. In MedDRA, this clearly happens, when a PT and its LLTs

share a synonymy relation.

37


2.3.5 NCI Thesaurus

The National Cancer Institute Thesaurus (NCIT) is a controlled terminology

for cancer research. The thesaurus has been converted to formal OWL syntax

and is updated at fixed intervals. The conversion was not an easy one; many

inconsistencies and modeling dead-ends that were encountered in the conversion

procedure have been documented Ceusters et al. (2005), along with some clear

violations of ontological principles Schulz et al. (2010). The NCIT provides almost

100000 concepts, with approximately 65% containing a definition.

38

Chapter 3

Similarity Metrics

Similarity metrics aim at measuring the lexical or semantic similarity between

terms. Lexical similarity focuses on terms that contain similar character or word

sequences, while semantic similarity tries to determine how close in meaning the

terms are. Lexical similarity is simpler to calculate, since string-based algorithms

only require plain text to function. On the other hand, semantic similarity re-

quires extra information about the terms present in plain text. This extra in-

formation is usually acquired with the help of a knowledge base (e.g. ontology,

terminology, etc.) or through statistical analysis of corpora, i.e., large collections

of text documents that resemble real-world usage of words.

3.1 Similarity Metric vs. Distance Metric

It is common in literature to come across the term ‘semantic distance’, instead

of ‘semantic similarity’. A distance metric d(a, b), that compares entities a and

b, must satisfy the following properties:

1. d(a, b) = 0 if and only if a = b (zero property),

2. d(a, b) = d(b, a) (symmetric property),

3. d(a, b) ≥ 0 (non-negativity property),

39

CHAPTER 3. SIMILARITY METRICS

4. d(a, b) + d(b, c) ≤ d(a, c) (triangular inequality).

On the other hand, the requirements for a similarity metric were formally intro-

duced not long ago Chen et al. (2009). The definition states that a similarity

metric s(a, b) must satisfy the following properties:

1. s(a, a) ≥ 0,

2. s(a, b) = s(b, a),

3. s(a, a) ≥ s(a, b),

4. s(a, b) + s(b, c) ≤ s(a, c) + s(b, b),

5. s(a, a) = s(b, b) = s(a, b) if and only if a = b.

The counter-intuitive 4th property can be proven, using set theory. More specif-

ically, if |a ∩ b| denotes the cardinality of common characteristics between a and

b, and c denotes the complement of c, the following equality holds:

|a ∩ b| = |a ∩ b ∩ c|+ |a ∩ b ∩ c|. (3.1)

Then,

|a∩ b|+ |b∩ c| = |a∩ b∩ c|+ |a∩ b∩ c|+ |a∩ b∩ c|+ |a∩ b∩ c| ≤ |a∩ c|+ |b|, (3.2)

since |a∩ b∩ c| ≤ |a∩ c| and |a∩ b∩ c|+ |a∩ b∩ c|+ |a∩ b∩ c| ≤ |b|. Deduction of

similarity from distance is a common procedure that requires simple operations.

Similarity is, intuitively, a decreasing function of distance. Conversion between

the two can take many forms Chen et al. (2009). In this thesis, all formulas will

be presented as similarity measures.

3.2 Lexical Similarity

String-based methods that calculate lexical similarity can be divided into character-

based and word-based. In this section, some of the most popular metrics are

presented. For a more complete survey of lexical similarity measures see Navarro

(2001) and Gomaa and Fahmy (2013).

40

3.2. LEXICAL SIMILARITY

3.2.1 Character-based Similarity Measures

In character-based similarity, strings are viewed as character sequences and at-

tempts are made to discover character relevance.

Longest Common Substring

The Longest Common Substring algorithm Gusfield (1997) tries to find the max-

imum number of consecutive characters that two strings share. It may be imple-

mented using a suffix tree or dynamic programming.

Hamming Similarity

Hamming similarity is a metric that can be applied to strings of equal length. It

is a simple metric that measures the number of common characters between two

strings. Given strings a and b, the formula for string similarity can be constructed

as follows:

simham(a, b) =

∑∀i

1(ai = bi)

|a|, (3.3)

where 1(·) is the indicator function and | · | denotes string length, measured in

characters.

Levenshtein Similarity

Levenshtein distance counts the number of character alterations that need to

be made in order to transform one string to another Levenshtein (1966). This

number is bounded by the length of the larger string, which is commonly used as a

normalizing measure that restrains the value of distance to [0, 1]. Mathematically,

normalized Levenshtein distance of terms a and b is computed using the following

formula:

dlev(a, b) =leva,b(|a|, |b|)max|a|, |b|

, (3.4)

41


where | · | denotes string length in number of characters,

leva,b(i, j) =

maxi, j , if mini, j = 0

min

leva,b(i− 1, j) + 1

leva,b(i, j − 1) + 1

leva,b(i− 1, j − 1) + [ai 6= bj]

, else(3.5)

and max·, min· denote the maximum and minimum functions, respectively.

Converting normalized distance to similarity can be done as follows:

simlev(a, b) = 1− dlev(a, b). (3.6)

Jaro Similarity

Jaro similarity Jaro (1989, 1995) takes into account both the number and sequence

of common characters present in the two strings. Let us consider strings a =

a1 . . . aK and b = b1 . . . bL. A character ai is said to be common with b if the

character exists in b within a window of min|a|,|b|2

from bi. Let a′ = a′1 . . . a′K′ be

those characters in a that are common with b, and b′ = b′1 . . . b′L′ those characters

in b that are common with a. A transposition for a′, b′ is a position i in the strings

a′, b′ in which a′i 6= b′i. The number of transpositions for a′, b′ divided by two is

denoted as Ta′,b′ . Then, Jaro’s formula for similarity is given by:

simjaro(a, b) =1

3

(|a′||a|

+|b′||b|

+|a′| − Ta′,b′|a′|

). (3.7)

It should be noted that Jaro similarity violates the symmetry property of Eq.

3.1, therefore it is not a true similarity metric, according to that definition.

Jaro-Winkler Similarity

Jaro-Winkler similarity Winkler (1999) is a variation of Jaro similarity which

promotes strings with long common prefixes. The length of the longest prefix

common to both strings a and b is denoted as P . Then, if P ′ = max(P, 4),

42

3.2. LEXICAL SIMILARITY

Jaro-Winkler similarity is given by:

simj&w(a, b) = simjaro(a, b) +P ′

10(1− simjaro(a, b)). (3.8)

N-gram Similarity

A string can be split into n-grams, i.e. all possible consecutive character sequences

of length n in the string. As an example, the word ‘protein’ can be split into the 3-

grams ‘pro’, ‘rot’, ‘ote’, ‘tei’ and ‘ein’. When comparing two strings, the number

of common n-grams is computed and normalized by the maximum number of

n-grams. More specifically, given strings a and b, similarity is given by:

simngram(a, b) =Ncom

Nmax

, (3.9)

where Ncom denotes the number of common n-grams and Nmax denotes the max-

imum number of n-grams in either of the two strings.

3.2.2 Word-based Similarity Measures

As the name implies, word-based measures view the string as a collection of words.

Similarity measures dictate how similar two terms are word-wise, and no weight

is given on character similarity.

Dice Similarity

Dice similarity considers input strings a and b as sets of words A and B respec-

tively, and calculates similarity as follows:

simdice(a, b) =2|A ∩B||A|+ |B|

, (3.10)

where | · | denotes set cardinality in number of words.

43


Jaccard Similarity

Jaccard similarity counts the number of common words of the compared strings

and divides it by the number of distinct words in both strings, i.e.

simjacc(a, b) =|A ∩B||A ∪B|

. (3.11)

Cosine Similarity

In order to compute cosine similarity, the compared strings should be converted to

vectors. The dimension of the resulting vectors will be equal to the total number

of distinct words present in both. Therefore, each element in the vector represents

one word. The vector values for each string are computed as follows: A vector

contains unitary values in positions that correspond to words that are contained

in the respective string. Similarly, a vector contains zero values in all positions

that correspond to words that are not present in the respective string. Given

strings a and b, the respective vectors a and b are computed. Cosine similarity

is then given by:

simcos(a, b) =a · b||a|| ||b||

, (3.12)

where || · || denotes the Euclidean norm function.

Manhattan Similarity

Taxicab geometry considers that distance between two points in a grid is given

by the sum of the absolute differences of their respective coordinates. The grid

resembles a uniform city road map, where diagonal movements are not permitted.

This is the reason why the distance metric in this space is often called Manhattan

distance or city block distance. Considering N -dimension string vectors a and b,

Manhattan distance can be computed as:

simmanh(a, b) = 1−

N∑i=1

|ai − bi|

N, (3.13)

where N is a normalizing constant that represents the dimension of a and b.

44

3.3. ONTOLOGICAL SEMANTIC SIMILARITY

Euclidean Similarity

Euclidean similarity also considers strings as vectors, and computes similarity as:

simeucl(a, b) = 1−

√√√√√ N∑i=1

|ai − bi|2

N. (3.14)

3.3 Ontological Semantic Similarity

An ontology is a collection of concepts and their inter-relationships. It may be

visualized as a graph, in which nodes represent concepts and edges represent the

relations between them. Usually, ontologies are viewed as taxonomies, where ‘is-

a’ and ‘part-of’ relations play the most important role. Viewing the ontology as a

taxonomy, one can apply semantic similarity metrics that exploit the hierarchical

structure. Probably the most famous object of semantic similarity tests is the

computational lexicon WordNet Miller (1995). In WordNet, closely related terms

are grouped together to form synsets. These synsets, in turn, form semantic rela-

tions with other synsets. WordNet is commonly referred to as a lexical ontology,

due to an obvious mapping of lexical hyponymy to ontological subsumption.

3.3.1 Intra-ontology Semantic Similarity

Intra-ontology semantic similarity metrics are meant to measure similarity be-

tween concepts that reside within the same ontology. These metrics can be

roughly divided into distance-based, information-based and feature-based.

Distance-based Metrics

Distance-based metrics take advantage of the ontological topology to compute

the similarity between concepts. This method requires viewing the ontology as

a rooted Directed Acyclic Graph (DAG), in which nodes are concepts and edges

among them are restricted to hierarchical relationships, with the most usual type

45


being ‘is-a’ relationships. At the top, there is a single concept, the root. The graph

is directed, starting from a low-level concept and directed towards its ancestors

through transitive relationships. The graph is also acyclic, since a finite path

from a source node to a destination node cannot return to the source node. In

other words, a node can never be a child of one of its children.

A simple look at an ontology from a geometric perspective may reveal im-

portant information about the similarity of concepts. As depth in the DAG

increases, concepts become increasingly specific, thus similarity is expected to

increase. Another important characteristic of the ontology DAG is that the path

between concepts is not always unique, therefore distance-based similarity will

depend on which path is chosen. Finally, the density of nodes is a good indicator

of similarity; as density increases, concepts approach each other and similarity

increases.

The accuracy of distance-based methods depends on the level of detail that

the ontology captures. A poorly structured ontology with many omissions might

yield misleading similarity results. Fortunately, a lot of effort has been made to

make biomedical ontologies as complete as possible, therefore network density in

biomedical ontologies is usually high.

The most straightforward way to measure the similarity of concept nodes is

given in Rada et al. (1989). In that work by Rada et al., all edges are assigned

a unitary weight and the distance between two concepts is equal to the number

of edges that are present in their shortest path. Let us consider two distinct

concepts c1 and c2 in the hierarchy. Each path i that connects these two concept

nodes may be represented as a set which includes all edges ek present in the path,

i.e.

pathi(c1, c2) = e1, e2, . . . , eK. (3.15)

with cardinality |pathi(c1, c2)| = K. The distance between concepts c1 and c2 is,

then, equal to the shortest path that connects them, i.e.,

drada(c1, c2) = min∀i|pathi(c1, c2)|. (3.16)

46


Note that in literature, there are cases (e.g. Al-Mubaid and Nguyen (2006)) where

Rada’s measure is used with node counting, instead of edge counting. In those

cases, each path is represented as a set of the nodes that compose it, including

the end nodes. The minimum distance can be converted into a similarity metric,

as in Resnik (1995):

simrada(c1, c2) = 2D− d(c1, c2), (3.17)

where D is the maximum depth of the taxonomy. This method fails to capture

the intuition that concept nodes, which reside at the lower part of the hierarchy

and are separated by distance d, are more similar than higher-level nodes with the

same distance separation d. Also, its success highly depends on the uniformity of

edge distribution within the ontology. For these reasons, other approaches have

been proposed in order to achieve a more representative score of similarity.

In Wu and Palmer (1994), the relative depth of the compared concepts in the

hierarchy is considered. In that work, Wu and Palmer introduce the Least Com-

mon Subsumer (LCS) of the compared concepts. The LCS is the hierarchically

deepest common ancestor of the compared concepts. Similarity for concepts c1

and c2 is then given as:

simw&p(c1, c2) =2h

N1 +N2 + 2h, (3.18)

where N1 is the number of nodes in the path between concept c1 and the LCS,

N2 is the number of nodes between concept c2 and the LCS, and h is the depth

of the LCS, measured again in number of nodes.

In Li et al. (2003), the authors followed various strategies in their attempt

to calculate similarity as a function of the shortest path between the compared

concepts, the depth of their LCS and the local density of the ontology. They

perceived that the best performance was obtained when they used the following

non-linear function:

simli(c1, c2) = e−α drada(c1,c2)eβh − e−βh

eβh + e−βh, (3.19)

where α, β are non-negative parameters and h = drada(LCS(c1, c2), root) denotes

the minimum depth of the LCS. Distances are measured in number of edges.

47


Al-Mubaid and Nguyen attempt to combine path length and node depth in one

measure. In Al-Mubaid and Nguyen (2006), they view the DAG as a composition

of clusters, with each cluster having as root a child of the ontology root. The

usage of clusters aims to exploit local characteristics of different branches. Given

concepts c1 and c2, they first compute their so-called common specificity:

Cspec(c1, c2) = Dc − h, (3.20)

where Dc denotes the depth of the specific cluster and h refers to the depth of the

LCS in the ontology, with both quantities measured in number of nodes. Then

similarity is computed as:

sima&n(c1, c2) = log((Path− 1)α × (CSpec)β + k), (3.21)

where Path is a modified version of Rada’s distance measure which is adapted

according to the largest cluster, and α, β, k are constants, whose default values

are unitary.

Information-Based Metrics

One of the first attempts to focus on nodes in the similarity formula is that

of Leacock and Chodorow Leacock and Chodorow (1998). This method uses

negative log likelihood in a way that resembles the formula of self-information

Cover and Thomas (2012), but does not really involve valid probability. Instead,

a normalized form of the path length between the concepts is used:

siml&c(c1, c2) = −log(Np/2D), (3.22)

where Np is the number of nodes in the shortest path between concepts c1 and

c2. This variable also includes the end nodes.

Resnik, in Resnik (1995), continues down this path by replacing the normal-

ized path length with a probability measure P(·) to calculate the information

content (IC) of a concept. He considers all common subsumers CSi of concepts

48


c1 and c2 and calculates similarity as:

simresn(c1, c2) = max∀i

[−log(P(CSi))], (3.23)

or, equivalently,

simresn(c1, c2) = −log(P(LCS)). (3.24)

Considering that the IC of a concept c is defined as the negative logarithm of its

probability, i.e. IC(c)= -log(P(c)), equation (3.24) can also be written as:

simresn(c1, c2) = IC(LCS(c1, c2)). (3.25)

Probabilities are estimated with the help of a text corpus, i.e. a collection of

nature language excerpts, specifically chosen to provide a good representation of

actual term usage. When dealing with biomedical ontology concepts, collections

of Pubmed1 abstracts are commonly used as corpora to determine the probability

of each concept.

Given a corpus, the occurrence of a term which corresponds to concept c

essentially implies the occurrence of each and every concept that subsumes c

within the ontological structure. Conversely, the number of occurrences of a

concept c depends not only on the number of appearances of c itself in the corpus,

but also on every occurrence of its descendants in the hierarchy. Thus, the number

of occurrences of concept c is given by:

occ(c) =∑

∀n=subsumed(c)

count(n), (3.26)

where subsumed(c) represents c and its children concept nodes, and count(·)

denotes the number of occurrences of the specific concept within the given corpus.

Converting occurrences to probability can be done using:

P(c) =occ(c)

N, (3.27)

where N is the total number of occurrences of ontology terms in the corpus.

This method results to higher probabilities for concepts residing at the top part

1http://www.ncbi.nlm.nih.gov/pubmed

49

http://www.ncbi.nlm.nih.gov/pubmed


of the hierarchy, with the root having unitary probability. Therefore, concepts

whose LCS lies lower in the hierarchy are more similar, since their LCS has low

probability (i.e., high IC).

A possible drawback of this method is that probabilities are tied to the choice

of corpus. So far, in the biomedical domain, there is no widely accepted corpus

that covers the domain needs Al-Mubaid and Nguyen (2006). This is due to the

fact that thousands of new terms and abbreviations appear in the literature every

year, thus a stable corpus might not function well. Since extensions of the corpus

would need to be considered at fixed intervals, it might not serve as a useful

benchmark.

Alternatively, computation of IC can be performed without the use of a corpus,

by solely relying on the structure of the ontology DAG. Intrinsic computation of

IC involves approximating the occurrence probability of a concept as a function

of multiple variables, such as number of descendant nodes, number of subsumers

or number of descendant nodes which are leaves in the ontology. In Seco et al.

(2004), the IC of a concept c is given by:

ICseco(c) = 1− log(descendants(c) + 1)

log(allConcepts), (3.28)

where descendants(c) returns the number of nodes that concept c subsumes, and

allConcepts denotes the number of all the available concepts in the ontology.

The IC function introduced by Seco et. al has the drawback that it assigns IC

equal to one for every leaf node in the ontology, and also that concepts containing

the same number of descendant nodes are again given the same IC. An attempt to

distinguish the IC between leaf concepts was made in Zhou et al. (2008), by also

including the depth of the node in the calculation, normalized by the maximum

depth of the ontology. The proposed IC formula is given by:

ICzhou(c) = kICseco(c) + (1− k)log(depth(c) + 1)

log(maxDepth), (3.29)

where depth(c) represents the depth of the concept c in the hierarchy, maxDepth

is the maximum depth of the ontology, measured in node number and k is a

weighting constant.

50


The authors in Sanchez et al. (2011) further improve the modeling of the IC

function. In that work, the IC function can also distinguish concepts that contain

the same number of descendants, due to the fact that the number of subsumers

of a concept is also used. The IC is given as:

ICsan(c) = −log

( leaves(c)ancestors(c)

+ 1)

allLeaves

), (3.30)

where leaves(c) is the number of nodes that are descendants of c and have no

children, ancestors(c) refers to the number of concepts which subsume c and

allLeaves denotes the total number of leaf nodes in the ontology. The IC func-

tions of equations (3.28), (3.29) and (3.30) can be used in equation (3.25) to

compute the similarity between two concepts without using a corpus.

Lin et al. use IC in an alteration of the similarity metric of Wu and Palmer

(1994). More specifically,

siml&p(c1, c2) =2 simresn(c1, c2)

IC(c1) + IC(c2), (3.31)

This approach aims to include the individual characteristics of the compared

nodes that Resnik’s approach neglected. Indeed, in Resnik’s measure, any two

pairs of nodes that have the same LCS produce the same similarity.

Jiang and Conrath follow a similar approach with Wu and Palmer (1994),

but avoid the scaling of similarity Jiang and Conrath (1997). Instead, they use a

distance metric as follows:

dj&c(c1, c2) = IC(c1) + IC(c2)− 2 simresn(c1, c2). (3.32)

Various transformations have been applied to convert this distance to similarity.

Among these, the authors in Seco et al. (2004) consider a linear transformation

and present the following formula of similarity normalized in the interval [0,1]:

simj&c(c1, c2) = 1− dj&c(c1, c2)

2. (3.33)

Another example can be found in Zhu et al. (2009), in which an exponential

function is used for the similarity formula, along with a constant λ that accounts

51


for curve steepness:

simj&c(c1, c2) = edj&c(c1,c2)

λ . (3.34)

Feature-Based Measures

Feature-based measures do not necessarily conform to the similarity metric rules

of Chen et al. (2009), as they allow for similarity asymmetry. In feature-based

techniques, the two compared concepts are viewed as sets of features, in contrast

to the geometric view presented in previous sections. To calculate similarity, not

only the common features of the concepts are taken into account, but also the

differences between them. That way, common features improve similarity, while

different features penalize its value Tversky et al. (1977). Given concepts c1 and

c2, let C1 and C2 denote the sets that contain their features. Then, similarity

between the two can be given as:

simtve(c1, c2) =|C1 ∩ C2|

|C1 ∩ C2|+ µ|C1 − C2|+ (1− µ)|C2 − C1|, (3.35)

where µ is a weight which takes values in [0,1]. In Rodrıguez et al. (1999), the µ

parameter is computed as follows:

µ =

d(c1,LCS)d(c1,c2)

, d(c1, LCS) ≤ d(c2, LCS)

1− d(c1,LCS)d(c1,c2)

, else(3.36)

This asymmetric function stems from Tversky’s observation that similarity might

not be symmetric. In one of Tversky’s examples, North Korea was said to be more

similar to Red China than the reverse.

3.3.2 Inter-ontology Semantic Similarity

Inter-ontology semantic similarity measures try to quantify the similarity between

concepts that belong to different ontologies. Fairly little research has been doc-

umented in this area, due to the inherent difficulty of comparing heterogeneous

structures. A common approach is to combine the different ontologies into a

52


single ontology through detailed concept mappings Gangemi et al. (1998). It is

clear that this is very challenging and requires the help of a domain expert, as

well as plenty of time and effort. Furthermore, not all biomedical terminologies

are consistent and their lack of homogeneity is a major problem. Simpler ap-

proaches have been proposed in the literature. A usual first step is to merge the

different ontologies under a dummy root. This approach is found in Rodrıguez

and Egenhofer (2003), where the authors use a weighted version of Tversky’s

similarity which also takes into account geometrical features of the ontologies.

A similar route is followed by Petrakis et al. (2006), where the authors substi-

tute Tversky’s similarity with a form of Jaccard similarity. The drawback of

these cross-similarity metrics is that they do not consider term overlap in both

ontologies. Other methods rely on extensions of single ontology similarity met-

rics. Examples of such work can be found in Al-Mubaid and Nguyen (2006) and

Sanchez et al. (2012).

53

54

Chapter 4

Search Interfaces

Search has risen to be one of the most commonly used tools for computer users.

It can be found everywhere, from stand-alone web-based search engines to em-

bedded search forms that appear in desktop applications and websites. To a large

extent, success of the search procedure depends on the users’ ability to formulate

their information needs, transforming them into queries that are highly likely to

produce desired results. For this reason, a lot of effort has been spent on improv-

ing the search interfaces and providing tools that will enhance user experience.

In this chapter, the basic characteristics of successful search interface design are

presented, with main focus on web-search interfaces.

4.1 Information Seeking Models

Information seeking models attempt to recognize and describe the strategies fol-

lowed by humans from the moment they sense a search need until the moment

they acquire desired results. The search procedure may be viewed as a repetition

of actions. In Sutcliffe and Ennis (1998), the authors identify the following four

actions in what is considered the standard model of information seeking:

1. Problem Identification

2. Articulation of Need

55

CHAPTER 4. SEARCH INTERFACES

3. Query Formulation

4. Evaluation of Results

The first step refers to conceptualization of the search need, while the second step

involves expressing this need in words. The third step requires the user to trans-

form the articulated need into a format that will be accepted by the underlying

search system. Finally, the fourth step refers to the procedure of judging the

results critically, exploiting any relevant domain knowledge and deciding whether

the need is satisfied. A search may be characterized as ‘ok’, ‘failed’ or ‘unsatis-

factory’. An ‘ok’ search ends the cycle successfully. An ‘unsatisfactory’ search

may lead to reformulation of the query or re-articulation of the need, while a

completely ‘failed’ search might require re-identification of the problem.

Sutcliffe and Ennis’s model assumes that the need does not change, unless

results are disappointing. It does not capture the fact that users learn as they

search. This dynamic aspect of information seeking was captured in an earlier

work by Bates Bates (1989). In that study, the user’s needs are assumed to change

as the process advances. Furthermore, Bates claims that the success of the search

procedure does not only depend on the final list of results, but on the selections

made along the way. This model is referred to as the berry-picking model, to

denote that it does not result in a single set of results. A simple example of the

berry-picking model can be illustrated when a user attempts a broad query such

as ‘String similarity algorithms’ and refines the query to ‘Jaro similarity’ after

viewing this result in the initial result list.

4.2 Query Specification

Queries are usually specified through rectangular entry forms, as in Fig. 4.1. The

width of these forms varies in size, with studies showing that wider forms promote

formulation of longer queries Franzen and Karlgren (2000); Belkin et al. (2003).

It has been observed that around 88% of search queries are composed of 1 to 4

56

4.2. QUERY SPECIFICATION

Figure 4.1: The google search engine entry form.

Figure 4.2: Facebook uses grayed-out descriptive text to help in the formulation of user

queries.

words, with mean length equal to 2.8 words per query Jansen et al. (2007). The

actual search is executed by pressing the return key or mouse-clicking a specified

button (e.g. magnifying glass in Bing). In some cases, entry forms decorate their

background with descriptive text that provides guidance for the user. An example

is Facebook’s search form, as seen in Fig. 4.2. The text disappears, once the user

clicks inside the form. This usually helps to narrow down the search domain.

After query submission, processing of the query takes place before any attempt

to retrieve results. This process may include removal of stopwords (i.e. words

with high appearance probability such as ‘the’, ‘a’), normalization of words (e.g.

plural to singular) and permutation of word order. Boolean logic may also be used

in the case of multiple words per query. Returning results that contain all query

words (i.e. Boolean AND operator) seems more intuitive, although this might

sometimes lead to overly specific queries that return no results. The actual types

of processing are often hidden from the users, in an attempt to avoid confusion

and promote transparency, Muramatsu and Pratt (2001).

Most modern search interfaces are equipped with dynamic search suggestion,

also known as auto-completion (See Fig. 4.3). As the user starts typing, a list of

57


Figure 4.3: Bing’s search interface features a powerful dynamic search suggestion, where

prefixes are highlighted with grayed-out font and the remaining text is in bold.

term suggestions appears under the entry form. The suggestions contained in the

list are usually queries whose prefix matches what has been typed so far, although

there are cases where interior matches are also included. The user can then mouse-

click the most relevant query or navigate through the list, using keyboard arrows.

Studies have shown that approximately one third of all search attempts in the

Yahoo Search Assist were performed through a dynamically suggested query An-

ick and Kantamneni (2008). The dynamic search suggestion technique attempts

to minimize unneeded typing from the user side and can alleviate spelling errors

early. Most importantly, though, it reassures the user that results are available,

so there is no frustration from empty result pages.

An important point to consider is that searchers often return to their pre-

viously accessed information. In the empirical study undertaken by Tauscher

and Greenberg Tauscher and Greenberg (1997), it was found that there is a 58%

chance that the next web page to be visited had been visited before. A more

recent study Zhang and Zhao (2011) about tabbed browsing, conducted in 2010,

also finds page revisitation to be around the same levels, at 59.3%. Various tools

58

4.2. QUERY SPECIFICATION

Figure 4.4: The Safari browser’s embedded search interface explicitly states which queries are

suggestions and which belong to the user’s recent search history.

Figure 4.5: The Firefox browser’s embedded search interface contains recent queries on top,

and separates them from suggestions using a solid line.

exist to help users find their intended pages, including Uniform Resource Locator

(URL) history, bookmarking of pages, basic navigation buttons (e.g. ‘Back’ but-

ton for short term page revisit) and change of URL font color if page has already

been visited. Among other methods documented, users may save whole webpages

to their local disk or keep URLs in text documents, after enriching them with

comments Jones et al. (2002). Interestingly, a common approach to revisiting

documents is actually re-searching for them Obendorf et al. (2007). Users who

59


Figure 4.6: Google’s search results page is a typical scrollable vertical list of captions. Meta-

data facets, that restrain results to a particular type of information, are also present in the

interface (e.g. ‘Images’ tab).

adopt this strategy attempt to re-create the conditions of their previous search, by

trying to formulate the exact same query. Another strategy requires past search

queries to appear as the user types, along with regular dynamic term sugges-

tion. Separation between suggested queries and previously generated ones varies

among interfaces, as can be seen in Figures 4.4 and 4.5.

4.3 Presentation of Search Results

Search applications usually present results as a vertical list of captions, distributed

along multiple pages (see Fig. 4.6). Each caption is a clickable entity which, as a

minimum requirement, comprises a title and an excerpt of the target document

Clarke et al. (2007). Usually, the excerpt includes some or all of the query terms,

as highlighted text. In most cases, highlighting is performed using bold font or

colored term background. Many search applications tend to group similar results,

that originate from the same source, into the same caption. That way, result

60

4.3. PRESENTATION OF SEARCH RESULTS

‘pollution’ from few sources is avoided and diversity is promoted. The relevance

of search results is reflected in their order of appearance. Although relevance

scores were formerly used to grade the fit of the result to the query, they are

usually not present anymore in modern search applications. The reasons behind

their omission might be to avoid reverse-engineering of the ranking algorithms and

to reduce redundancy, since the ranking itself already reflects the importance of

results Hearst (2009).

It has been observed that users tend to click on the uppermost captions

Joachims et al. (2005). In the same study, it was found that the first caption

received more attention than its successors, even if its relevance was actually

lower. Furthermore, the majority of users often remain on the first page of re-

sults. The authors in Jansen et al. (2007) observed that only 30% continued to

look for relevant results in the second page of the results, and only 15% looked

even further. Usually, the patience of a user is a function of his/her experience

in using the system. More experienced users tend to be more patient than users

who are not accustomed to the search procedure. Inexperienced users, on the

other hand, often prefer to refine their query or simply accept that what they

search for cannot be found by the search application Hearst (2009).

Apart from plain lists of results, further organization of captions may be per-

formed, using some form of faceted browsing. Facets attempt to refine search

results, according to their characteristics. As an example, Amazon’s search in-

terface provides facets that correspond to the different departments that might

contain the desired item (see Fig. 4.7).

61


Figure 4.7: Amazon’s search interface provides facets as a left panel to the results page,

helping the user dynamically refine the initial search.

4.4 Query Reformulation

It is common that desired search results are not discovered with the first try.

Query reformulation is the procedure which attempts to transform the original

query to a format that will match the information retrieval system’s vocabulary.

Studies using query logs have shown that the number of reformulated queries may

reach up to 52% Jansen et al. (2005) of all queries. It has been observed that,

if no help for query reformulation is given explicitly by the search application,

users tend to provide simple alterations of the initial query Hertzum and Frøkjær

(1996). This bias towards initial queries is referred to as anchoring, a term coined

by psychologists Tversky and Kahneman (1975).

One of the most common sources of search failure is query mistyping Cucerzan

and Brill (2004). A common approach, which aims to correct typographical errors,

is using a dictionary and finding the most similar term to the erroneous query

Kukich (1992). Among other techniques mentioned in that work are heuristic

rule-based corrections, probabilistic approaches that determine how often specific

62

4.4. QUERY REFORMULATION

sequences of characters are spelt wrong, and neural network models that train

the system to automatically identify errors. The outcome of the reformulation

procedure may be shown explicitly on the interface as a suggested query (e.g.

Google’s ‘Did you mean’), or be implicitly shown in the results. The former

approach is preferred, since it gives users freedom to decide whether their intent

is actually captured in the proposed correction. More recently, distributional

approaches that take advantage of user query logs are preferred, especially by

web-based search engines Li et al. (2006).

Another dimension of query reformulation is term expansion. Term expansion

refers to the suggestion of queries that relate to the initial one in some way.

Choice of related queries might take the form of thesaurus-based term substitution

Dennis et al. (1998) or attempt to extend the present query, usually by adding

single words (see Fig. 4.8). Query suggestion might also be fetched from sessions

of users who previously searched for the same information. In has also been

proposed that search applications ask the user to provide relevance feedback

Ruthven and Lalmas (2003). Although theoretical studies approve of this feature,

its appearance in commercial applications is rare.

63


Figure 4.8: Pubmed’s results page includes term expansion in two ways. On the right of

the screen, there is a ‘Related searches’ panel that preserves the initial query and adds a new

related term to it. Also, right below the entry form there is a ‘See also’ feature which suggests

complete or partial modifications in the initial query.

64

Chapter 5

Requirements

This chapter describes the objective of the project and the required functionality

for the application, as stated by the AstraZeneca side.

5.1 Feature Specification

The objective of this project is to deliver a search application that allows re-

searchers to quickly perform queries for terms included in medical ontologies and

gain access to information about the chosen terms in intuitive ways. The appli-

cation should not rely on the searcher’s knowledge about the structure of specific

ontologies. The interface should be enhanced with interactive tools that guide the

user towards the desired term; this includes term auto-completion, input query

error correction, suggestion of similar terms, clever ranking and grouping of search

results. The deliverable should be straightforward to use and easy to distribute

to users, independent of the different operating systems that they might use.

Furthermore, it should include the terminology MedDRA, which is widely used

by researchers within the company.

The previous search application used within AstraZeneca did not manage to

meet the users’ requirements and was abandoned, as users had to refer to external

sources (e.g. Google) to refine their searches, when the application presented un-

65

CHAPTER 5. REQUIREMENTS

Table 5.1: Documented failed queries and suggested reasons for failure.

Query Comments Suggested Reason for Failure

Hepatotoxicity Searcher did not find the term

and decided to search on-

line to find a synonym for it

and reformulate the query as

‘Liver Disease’.

Wrong ontology choice by user. The

term is clearly in MedDRA. It is

also not an LLT, so the application

would find it.

NSCLC The acronym refers to ‘Non-

Small Cell Lung Carcinoma’,

a concept which is listed in

NCIT. Search returned no re-

sults.

Although the abbreviation ‘NSCLC’

is documented in NCIT, it is not a

preferred name so it was bypassed

by the program.

DIHS Searcher expected the concept

‘Drug-induced hypersensitiv-

ity syndrome’ in MedDRA.

No results were returned.

DIHS does not appear as an abbre-

viation in MedDRA, so this behav-

ior was normal. Searcher needed

to explicitly specify the preferred

name, which is ‘Drug-induced hy-

persensitivity syndrome’.

DRESS

Syndrome

Refers to the same concept as

DIHS. It was not found.

The term exists as an LLT in

MedDRA. The application did not

search for LLTs.

wanted results. The users’ lack of knowledge around formal logic and ontological

structure played an important role towards this result. To quote the AstraZeneca

side, “Many of our users do not understand the concept of an ontology and, as

a result, at best, struggle to use such an interface and, at worst, refuse to use

the tool (e.g. they don’t understand the concept of parent/child or if there are

multiple terms which should they choose). What users are more familiar with is a

google-like interface whereby they are able to type in their search terms without

knowledge of an ontology or what that means for them.”

Although no log file containing extensive lists of query failures is available

66

5.1. FEATURE SPECIFICATION

Table 5.2: Documented failed queries and suggested reasons for failure (cont.).

Query Comments Suggested Reason for Failure

VEGFR Searcher came across multiple

returned terms and did not

know which one(s) to choose.

Therefore, all were chosen.

The application does not help the

user visualize possible relationships

among results. Also, NCIT lists

VEGFR as a synonym for both

‘Vascular Endothelial Growth Fac-

tor Receptor’ and ‘Vascular En-

dothelial Growth Factor Receptor

1 (VEGFR-1)’, so it is up to the

searcher to decide which one is

needed.

LHRH Most relevant result was ‘Go-

nadotropin Releasing Hor-

mone’. The searcher did not

know that term, and did not

understand why the results

did not contain the query.

The preferred term for ‘LHRH’ is

‘Gonadotropin Releasing Hormone’.

NMDA

Antagonist

The searcher wanted to find

a list of the different NMDA

antagonists. No results were

found in NCIT, MedDRA or

ICD.

This is an ontology organization

characteristic. For example, in

NCIT, antagonists do not all reside

under a general term ‘NMDA an-

tagonist’. The NMDA antagonist

‘Ketamine’ is listed in NCIT as a

subclass of ‘Anesthetic Substance’,

while ‘Aptiganel’ is listed as a sub-

class of ‘Neuroprotective Agent’.

for AstraZeneca’s search application, examples of failed queries have been given.

The reasons behind query failure are diverse; Tables 5.1,5.2 list some of the most

characteristic failed queries, along with given or deduced justifications for the

67

CHAPTER 5. REQUIREMENTS

reason of failure. It is clear that failure of some queries was due to the content

of the ontologies, therefore inevitable. Other causes of failure included wrong

ontology chosen by the user, incomplete term coverage by the search application,

lack of help and guidance from the system (e.g., relevance feedback or result

visualization). These application-level failures should be targeted and alleviated.

68

Chapter 6

Design

This chapter addresses the design considerations for each stage of the project. In

particular, three distinct stages can be identified; the first involves gaining access

to ontologies, the second is concerned with semantic similarity calculations, while

the third covers data presentation and interface design.

6.1 Stage I: Access to Medical Ontologies

The first design stage involves gaining access to medical ontologies and terminolo-

gies. It might be argued that ontologies should be exploited in a formal ontology

language representation, such as OWL. This was abandoned for the following

reasons:

1. Not all medical terminologies respect ontological principles, thus they are

not all representable in a formal ontology language.

2. Access to the original format of some structured vocabularies (e.g. Med-

DRA) is neither public, nor free.

3. Currently, using the Java OWL1 Application Programming Interface (API),

large OWL ontologies need to be kept in main memory for the whole du-

1http://owlapi.sourceforge.net/

69

http://owlapi.sourceforge.net/

CHAPTER 6. DESIGN

ration of program execution, fact which would degrade performance in the

case of multiple ontologies.

Fortunately, BioPortal2 has already represented hundreds of ontologies and ter-

minologies in a common format, which is publicly accessible through the web Noy

et al. (2009).

As a result of the above observations, it was decided that the best design

choice would be to maintain a local MySQL database with ontology terms. For

demonstration purposes, three different structured vocabularies are used in this

project:

• NCIT

• MedDRA

• ICDv9

They are downloaded from BioPortal and saved locally. From these, only NCIT

is frequently updated, at approximately monthly intervals. The used versions of

NCIT, MedDRA and ICDv9 contain 97946, 69389, and 22400 concepts, respec-

tively.

6.1.1 Database and Table Creation

Initially, a MySQL database named ‘Ontologies’ is created locally. The database

holds a total of seven tables, having the following names:

• CONCEPTS

• DEFINITIONS

• SYNONYMS

• ROOTS

2http://bioportal.bioontology.org/

70

http://bioportal.bioontology.org/

6.1. STAGE I: ACCESS TO MEDICAL ONTOLOGIES

• PARENTS

• SIMILARITY

• MDR RELATED

Table 6.1: ‘Ontologies’ database table structure

Table Name Type

CONCEPTS code

preferredName

ontology

varchar(20)

text

varchar(15)

DEFINITIONS code

definition

varchar(20)

text

SYNONYMS code

synonym

varchar(20)

text

ROOTS code

ontology

varchar(20)

varchar(15)

PARENTS code

parentCode

varchar(20)

varchar(20)

SIMILARITY termcode1

termcode2

rada

wu

resnik

li

varchar(20)

varchar(20)

double

double

double

double

MDR RELATED code

relatedCode

varchar(20)

varchar(20)

The ‘CONCEPTS’ table will hold basic information about the concepts that

are present in an ontology. More specifically, for each concept, a record which

contains its preferred name, code, and ontology will be inserted to the table. Due

to the fact that multiple definitions and synonyms might exist for a single con-

cept, these will be held in separate tables, ‘DEFINITIONS’ and ‘SYNONYMS’,

71

CHAPTER 6. DESIGN

respectively. The ‘ROOTS’ table will contain all the top level terms of the on-

tology/terminology. Usually, multiple independent hierarchies exist, therefore

multiple ‘roots’ can be found. For example, MedDRA contains 26 parallel hi-

erarchical structures. These so-called ‘roots’ can be joined under a top-level

universal imaginary node, that guarantees the presence of a single root in the

ontology/terminology. The table ‘PARENTS’ will contain hierarchical informa-

tion about the terms. For each concept, all of its parents will be listed. This

table can be exploited to compute semantic similarity at the next stage. The

‘SIMILARITY’ table will hold semantic similarity scores between pairs of con-

cepts that belong to the same ontology. The similarity metrics used are those

of Rada, Wu-Palmer, Resnik and Li. Finally, the ‘MDR RELATED’ table will

contain MedDRA-specific concepts that do not clearly belong to any hierarchy

themselves, but are considered very close to terms that do. The detailed struc-

ture of tables is shown in Table 6.1. All tables, except ‘SIMILARITY’ will be

populated at this stage.

6.1.2 Populating the Database Tables

The procedure for downloading the chosen ontologies and populating the database

tables relies on the BioPortal Representational State Transfer (RESTful) ser-

vices3. These services allow the transfer of medical ontology information, from

BioPortal servers to end user systems, through the Hypertext Transfer Protocol

(HTTP). The response is, by default, in XML format, with limited support for

JavaScript Object Notation (JSON) format. Complete support for JSON out-

put is scheduled for next release. Accessing the BioPortal RESTful services is

performed through the usage of intuitive Uniform Resource Identifiers (URIs) of

predefined structure. All that is required for gaining access to the RESTful ser-

vices is a user-specific API key, which is immediately given when a free account is

created on the BioPortal website. Some examples of the types of available term

3http://www.bioontology.org/wiki/index.php/BioPortal_REST_services

72

http://www.bioontology.org/wiki/index.php/BioPortal_REST_services


services are given in Table 6.2. Quantities in brackets are user-defined. As an

example request, consider the ‘get all terms’ service for NCIT:

http://rest.bioontology.org/bioportal/virtual/ontology/1032/all?pagesize=

50&pagenum=1&apikey=c6ae1b27-9f86-4e3c-9dcf-087e1156eabe. The virtual on-

tology id 1032 refers to NCIT. As stated before, the API key is a string identifier

which is received upon free registration to BioPortal. The response includes the

first 50 terms of the NCIT ontology. A (part of the) XML response is shown in

Fig. 6.1. It should be observed that the ‘get all terms’ service does not actually

return all terms from a specific ontology at once; for each request, the user must

provide a ‘terms-per-page’ number, and the particular page that he/she wishes to

view. All pages can be returned, if the user continues issuing page requests with

increasing pagenum, provided that the user knows the number of concepts that

the ontology includes.

Table 6.2: Examples of URI formats for BioPortal RESTful services.

Service URI format Comments

Get all

terms

http://rest.bioontology.org/bioportal/

virtual/ontology/ontologyid/all?pagesize=

pagesize&pagenum=pagenum&apikey=

YourAPIKey

Returns all

terms of an

ontology page

by page.

Get

concept

info


virtual/ontology/ontologyid/

conceptid&apikey=YourAPIKey

Returns infor-

mation about

a specific term,

such as syn-

onyms and

definitions.

Get

latest

ontology

version


virtual/ontology/ontologyid?apikey=

YourAPIKey

Returns the cur-

rently used ver-

sion id of an on-

tology.

73

http://rest.bioontology.org/bioportal/virtual/ontology/1032/all?pagesize=50&pagenum=1&apikey=c6ae1b27-9f86-4e3c-9dcf-087e1156eabe

http://rest.bioontology.org/bioportal/virtual/ontology/1032/all?pagesize=50&pagenum=1&apikey=c6ae1b27-9f86-4e3c-9dcf-087e1156eabe

http://rest.bioontology.org/bioportal/virtual/ontology/ontology id/all?pagesize=pagesize&pagenum=pagenum&apikey=YourAPIKey




http://rest.bioontology.org/bioportal/virtual/ontology/ontology id/concept id&apikey=YourAPIKey



http://rest.bioontology.org/bioportal/virtual/ontology/ontology id?apikey=YourAPIKey



CHAPTER 6. DESIGN

Figure 6.1: A part of the XML response for the ‘get all terms’ query of Table 6.2.

Access to BioPortal RESTful services can be achieved programmatically in a

simpler and automated manner, using the ontoCAT4 Java API. This API pro-

vides classes and methods tailored to the BioPortal services. It provides a high

level abstraction, that handles queries and XML responses behind the scenes and

returns lists of Java objects that contain the information needed to populate the

database tables. The provided methods are shown in Fig. 6.2.

The ontoCAT API method ‘getAllTerms()’ returns a list of all terms in the

ontology, which is what is needed in this project. Its drawback is that it keeps all

ontology terms in memory, causing a heavy memory burden which may lead to

‘out of memory’ exceptions when further processing is needed. For this reason,

I introduced a new function ‘getAllTermsPageByPage()’, which allows retrieving

and processing terms page by page in a loop. Then, memory can be released

after each iteration. In order to save information to the database tables, the

4http://www.ontocat.org/

74

http://www.ontocat.org/


Figure 6.2: The provided methods of the ontoCAT API Adamusiak et al. (2011).

‘getAllTermsPageByPage()’ method is called. It is chosen that pagesize=1, so

that only one concept per page is returned. Then, for each concept returned

by ontoCAT, the required information is saved to the appropriate table in the

‘Ontologies’ database. The procedure is shown in Fig. 6.3. The Java applica-

Figure 6.3: Populating the ‘Ontologies’ database is performed with the help of the ontoCAT

API.

tion, that was developed for this project, requests all concepts of a BioPortal

ontology, page by page, using ontoCAT methods. OntoCAT acts as an inter-

mediary, responsible for accessing the RESTful services of BioPortal. It returns

75

CHAPTER 6. DESIGN

Java object(s) back to the Java application, after processing the XML response

of BioPortal. Once the Java application receives information about a term, all

that is left is to choose the appropriate table(s) in the ‘Ontologies’ database and,

through the Java Database Connectivity (JDBC) API, insert record(s) of MySQL

format. Once all pages are processed, the Java application finishes execution and

all tables, except SIMILARITY, are populated.

6.2 Stage II: Computation of Semantic Similar-

ity

This stage deals with the calculation of semantic similarity scores between pairs

of concepts that reside in the same ontology. Semantic similarity scores will be

saved in the SIMILARITY table and will later be used in the search application

for the semantic grouping of search results and the suggestion of highly similar

terms to a term chosen by the user. To populate the SIMILARITY table, the

already populated tables CONCEPTS, PARENTS and ROOTS will be used.

6.2.1 Term Neighborhoods

Computing semantic similarity between all concept pairs in an ontology is a te-

dious task which requires a lot of computational and storage resources. Let us

consider NCIT as an example: there are 97946 concepts, yielding 979462 pairs5,

whose semantic similarity must be calculated. This is not the only burden; seman-

tic similarity calculation of a single pair is, by itself, a time-consuming process.

For example, even for the simple Rada edge-counting measure, all connecting

paths between two concepts must first be computed (i.e. a recursive process)

and, finally, the shortest one chosen. In large ontologies, it is not unusual that

5actually, due to the symmetric property of similarity, there is no need to calculate all 979462

pairs. Also, self similarities can be avoided, depending on the similarity metric used. Still, the

numbers are huge.

76

6.2. STAGE II: COMPUTATION OF SEMANTIC SIMILARITY

multiple paths of variable length exist between two concepts, so finding the min-

imum path is not as trivial as it may seem.

In the final search application, semantic similarity will be used for suggesting

highly similar terms to the query or grouping highly similar terms. Therefore,

term pairs whose semantic similarity is low will never be needed. For example,

there is no point in storing or even computing the similarity between the NCIT

concept ‘Greece’ and the concept ‘Lung’, since the resulting very low score will

never be used in the search application itself. The term ‘Greece’ will never be

suggested as a highly similar term of ‘Lung’, and vice versa.

For the above reasons, the design choice for this project is to exploit the ge-

ometrical structure of ontologies/terminologies and, for each concept, calculate

semantic similarity only with concepts that are placed within a certain neighbor-

hood from it. Given a concept c, its neighborhood is chosen to contain:

• All concepts that are descendants of c at most two levels down in the hier-

archy.

• All concepts that are siblings of c.

• All concepts that are ancestors of c, at most two levels up in the hierarchy.

This choice greatly simplifies the computational burdens associated with semantic

similarity computation in huge ontologies, without threatening the performance

of the search application. Furthermore, valuable mySQL storage is not wasted.

6.2.2 Semantic Similarity Calculation

In this project, four different semantic similarity metrics have been chosen: Rada,

Wu and Palmer, Resnik and Li. Due to lack of a specific corpus for Resnik

similarity, Seco’s formula is used, as presented in Chapter 3 (see 3.28). For the

calculation of semantic similarity, I developed a Java application, which contains

the following basic methods6:

6method parameters and other utility methods are not shown, for simplicity

77

CHAPTER 6. DESIGN

• getAllPathsToRootDB()

• getMinimumPathToRootDB()

• getAllPathsBetweenTwoConceptsDB()

• getMinimumPathBetweenConceptsDB()

• computeLocalSimilarities()

• NormalizedRadaSimilarity()

• WuPalmerSimilarity()

• LiSimilarity()

• ResnikSimilarity()

The method getAllPathsToRootDB() uses the PARENTS table to recursively

build all paths between a concept and any of the roots of an ontology. Recursion

stops every time a concept which belongs to the ROOTS table is encountered.

The method getMinimumPathToRootDB() simply calls getAllPathsToRootDB()

and chooses the minimum path out of the returned ones. The method getAll-

PathsBetweenTwoConceptsDB() first computes each term’s paths to the root

separately, using the getAllPathsToRootDB() method. Then, it compares each

of the first term’s paths to root to each of the second term’s paths to root; if any

two paths have common nodes, it means that a common path (that passes through

their LCS) can be defined between the nodes; if no common nodes are present,

a common path only exists through the single (imaginary) root of the ontology.

The method getMinimumPathBetweenConceptsDB() simply calls getAllPaths-

BetweenConceptsDB() and selects the shortest one.

The methods NormalizedRadaSimilarity(), WuPalmerSimilarity(), LiSimilar-

ity(), and ResnikSimilarity() call the previously mentioned path building methods

with two concepts as arguments, and produce a numerical value that corresponds

to the particular similarity metric. The method computeLocalSimilarities() is

78

6.3. STAGE III: INTERFACE DESIGN DATA PRESENTATION

the one that is called from main(). This method is responsible for computing the

neighborhoods of a term, calling the NormalizedRadaSimilarity(), WuPalmerSim-

ilarity(), LiSimilarity(), ResnikSimilarity() on each pair of concepts, and saving

the results to the SIMILARITY table.

6.3 Stage III: Interface Design Data Presenta-

tion

At the end of stage II, the ‘Ontologies’ database is complete and does not need

further changes. The third stage deals with querying the available data and

presenting it to the end user. It has been chosen to utilize web technologies

for developing the search application. Building the search application in a web

environment presents, among others, the following advantages:

• The files reside on a central server, and not on each of the clients’ machines

individually. Updates may be done transparently.

• Access to the application by client systems is independent of their operating

system.

• The application can benefit from the browsers’ built-in functionality (e.g.

no need to provide separate back-forward buttons).

• The application can benefit from the huge variety of interactivity tools that

have been designed for webpages.

The information to be presented is fetched from the populated MySQL ta-

bles using the server-side scripting language PHP Hypertext Preprocessor (PHP).

Presentation and styling are achieved using the Extensible HyperText Markup

Language (XHTML) and Cascading Style Sheets (CSS), respectively. Auto-

completion is performed using Asynchronous JavaScript and XML (AJAX) which

79

CHAPTER 6. DESIGN

returns data in JSON format, to be fed to the Twitter Typeahead jQuery plu-

gin7. To further favor interactivity, various jQuery plugins are selected, includ-

ing Tipsy8 and Throttle/Debounce9. Finally, for visualization purposes, the D3

framework10 is used. The major advantage of all the above technology choices

is that they are widely used, cross-platform and open-source, meaning that they

are actively maintained, highly portable and modifiable. More details about their

usage will be presented in chapter 7.

6.4 Summary of Technology Choices

A summary of the technology choices for the project is shown in Table 6.3. The

table is divided into sections that refer to the three stages described previously.

The technologies, languages, frameworks and APIs used at each particular stage

are mentioned.

7https://github.com/twitter/typeahead.js8https://github.com/jaz303/tipsy9http://benalman.com/projects/jquery-throttle-debounce-plugin/

10http://d3js.org/

80

https://github.com/twitter/typeahead.js

https://github.com/jaz303/tipsy

http://benalman.com/projects/jquery-throttle-debounce-plugin/

http://d3js.org/

6.4. SUMMARY OF TECHNOLOGY CHOICES

Table 6.3: Technology choices for the project.

Stage Description Technologies, Languages,

Frameworks, or APIs

I Access

to Medical

Ontologies/Terminologies

Java

BioPortal RESTful Web API

ontoCAT Java API

JDBC API

MySQL

II Computation of

Semantic Similarity

Java

JDBC API

MySQL

III Interface Design and

Data Presentation

PHP

MySQL

AJAX

XHTML

CSS

JavaScript

D3

jQuery Twitter Typeahead

jQuery Tipsy

jQuery Throttle/Debounce

JSON

81

82

Chapter 7

Implementation

This chapter provides a thorough description of the features that are present in

the final search application. It introduces the visual interface, which is respon-

sible for interaction with the end user. Furthermore, it familiarizes the reader

with the functionality of the individual components that are responsible for the

presentation, styling and interactive behavior of the application.

7.1 Structure

The organization of the files used for building the web application is listed in

Fig. 7.1. The functionality of each file is briefly described in Tables 7.1, 7.2, 7.3

and 7.4.

7.2 Search Entry Form

As mentioned in section 4.2, queries are usually less than or equal to 4 words.

That result reflects query specification in web-based search engines, where users

can search about any topic they wish for. In the more granular biomedical do-

main, users usually attempt more targeted searches. Furthermore, the application

to be deployed in this project is aimed at term searching, instead of document

searching. Thus, users are aware that they are searching for short-length terms

83

CHAPTER 7. IMPLEMENTATION

instead of multi-page documents, and it is likely that queries are even shorter

than the average 2.8 words. Indeed, the example queries given by AstraZeneca

are comprised of at most two words. Also, due to the auto-completion feature,

lengthy terms will not need to be typed, but simply chosen from a dynamic list.

Despite the fact that short queries are expected, a wide entry form is chosen, to

resemble ‘Google-like’ experience and provide better visibility for auto-completion

features.

Figure 7.1: The organization of the files that comprise the web application. These files are

responsible for the presentation, styling and interactive behavior of the web application.

84

7.2. SEARCH ENTRY FORM

Table 7.1: PHP files used in the search application.

File Description

mysqli connect.php Script which establishes a connection to the MySQL

‘Ontologies’ database. This script should not be pub-

licly accessible, for security reasons.

index.php The main page. It also handles enter-key or mouse-

click searches, by querying the ‘Ontologies’ database

and presenting the search results table.

performQuery.php Script which queries the ‘Ontologies’ database and

echoes a JSON array of the results.

terminfo.php Presents information about a specific term, including

its code, definitions, and synonyms. A visualization

of highly similar terms is shown, using d3.v3.min.js

and jquery.tipsy.js. Also, an XML version of the vi-

sualization is shown.

Combinatorics.php Performs permutations of a set of items (e.g. words

of the query).

JaccardSimilarity.php Computes the Jaccard lexical similarity between two

strings.

Table 7.2: XHTML files used in the search application.

File Description

header.xhtml Contains the shared header information among all

web pages. This includes the search box.

footer.xhtml Contains the shared footer information among all

web pages.

The search box can be seen in Fig. 7.2, inside the main window of the search

application (index.php). The search box is placed at the top-central part of the

interface. It is visible on every page that a user visits, so that new queries can be

performed anytime the user wishes. The box is characterized by rounded corner

85


Table 7.3: CSS files used in the search application.

File Description

contentStyle.css Defines styles for the web application interface.

tipsy.css Defines styles for building interactive tooltips.

type.css Defines styles for the auto-completion function.

Table 7.4: JavaScript files used in the search application.

File Description

d3.v3.min.js A JavaScript library that allows binding arbitrary

objects to the DOM. It facilitates the development

of visualization tools.

hogan-2.0.0.js A JavaScript library that allows the sharing of tem-

plates between client and server.

jquery-1.10.1.js A JavaScript library which facilitates DOM manipu-

lation, event handling, animation and AJAX.

jquery.ba-throttle-debounce.js A plug-in for throttle and debounce. Throttle limits

the rate of execution of handlers. Debouncing en-

sures that a function is executed only once within a

certain time period.

jquery.tipsy.js A jQuery plugin for creating Facebook-like tooltips.

typeahead.js A jQuery plug-in for auto-completion, developed by

Twitter. It may receive an array of JSON objects to

build the auto-completion pop-up menu.

performAsynchronousQuery.js A script which calls performQuery.php and feeds the

returned JSON object array to typeahead.js.

edges, a CSS3 feature. Also, a helpful message is set as a placeholder when the

search box is out of focus. This message informs the user of the type of query that

should be input. Once the user clicks inside the box, the grey message disappears

and a blinking cursor appears (see Fig. 7.3). If the user clicks anywhere else

within the page, the message reappears.

86

7.2. SEARCH ENTRY FORM

Figure 7.2: The main window of the search application. The search box is placed at the

top of the screen, with central horizontal alignment. A submit button labeled ‘Search’ is also

provided, to assist users that prefer mouse-clicking.

Figure 7.3: Once the user clicks inside the search box, the grey help message disappears and

a blinking cursor takes its place.

87


7.3 Handling the Input Query

The user may input a multi-word query in the provided search box. Handling the

input query depends on the speed that the user is typing, and the keys or buttons

that are pressed or clicked. To trigger the search, the user has the freedom to

choose among pressing the return key, selecting a term from the pop-up auto-

completion menu and mouse-clicking the button labeled ‘Search’, which is placed

on the right side of the search input form.

7.3.1 Typing Speed

If a user presses keys at a fast pace, there is no need to burden the server with

consecutive requests, when only the last response will be examined by the user.

To achieve such functionality, a debounce function is used (defined in jquery.ba-

throttle-debounce.js), which ensures that only the last event is taken into account,

within a certain microsecond time period. Then, unintended requests are avoided

and the application’s performance is maintained at high levels.

7.3.2 Querying the Database

Once a query has been approved for processing, it is sanitized, i.e. it is ensured

that its format is appropriate for insertion into a formal MySQL query and that

SQL injections are avoided. The formed MySQL query searches for terms that

contain the input words as prefixes, in the CONCEPTS and SYNONYMS tables

of the ‘Ontologies’ database. For example, an input query ‘can lun’ will return,

among others, the terms ‘lung cancer’ and ‘cancer of lung’, since all input

words are found as prefixes of words included in the terms. On the other hand,

the query ‘carc lun’ will not return the above two terms, since the ‘carc’ term

is not matched. It should be noted that order of the input query words is not

important. Also, mid-word matches are not supported, so a query ‘ance’ will not

return the term ‘cancer’.

88

7.3. HANDLING THE INPUT QUERY

Finally, it has been chosen that only a single result is returned per concept; a

single concept might have multiple synonyms that match the same query. For

example, the query ‘lung ca’ returns both ‘lung cancer’ and ‘lung carcinoma’,

terms which correspond to the same concept. Presenting both terms in the results

would be redundant, so only the lexically closest term to the query is presented

(i.e., ‘lung cancer’). Thus, a term appearing in the results is not always the

preferred term for a concept, but the term that best matches the given input

query.

7.3.3 Ranking and Grouping of Search Results

Lexical similarity determines the ranking of search results, independent of how

the search is triggered. For each term returned from the database query, the

lexical similarity of its term name is computed against the input query. The final

score is the maximum of a character-based and a word-based lexical similarity. In

this project, Levenshtein and Jaccard similarities are used, implemented as PHP

functions. The similarity takes a value in [0, 1] and is converted to a percentage

for visual purposes.

Semantic similarity determines the grouping of search results. For each term

in the results, its semantic similarity is retrieved with all the remaining result

terms that reside lower in the table. This is achieved through MySQL queries

to the SIMILARITY table. Highly similar terms (i.e., whose semantic similarity

score is larger than a threshold, 0.75 or 75% in this project), are grouped together.

From the semantic group, the term with highest lexical similarity to the query

acts as the main concept in the table row, and similar terms appear indented.

This choice preserves the lexical ranking. As an example, a search for ‘Lung’ is

shown in Fig. 7.4. The terms ‘Right Lung’ and ‘Left Lung’ are highly similar

to ‘Lung’, so are presented in the same row. The main term which shelters the

rest is ‘Lung’, since it is lexically identical to the input query. Semantic grouping

is performed only in the return-key or mouse-click search cases, and not in the

89

Figure

7.4:

Term

s,th

atw

ould

ap

pear

onth

eirow

nta

ble

row,

are

gro

up

edu

nd

era

more

lexically

-match

ing

termto

the

qu

ery,w

hen

their

seman

ticsim

ilarity

toth

atterm

ishig

her

than

ath

reshold

.

90


auto-completion menu.

7.3.4 Return-key or Mouse-click Search

If the user presses the Return key or clicks on the ‘Search’ button, the query is

processed by index.php. The form is submitted using the HTTP GET method, as

can be seen from the URL of Fig. 7.5. The index.php receives the ‘query’ string

through the predefined $ GET variable in PHP. After the MySQL database is

queried, results are presented in an array with clickable entries that redirect

to the specific term information page. Lexical ranking and semantic grouping

are performed. Each array entry contains basic information about the specific

concept, including term name, preferred name for the concept, code identifier in

the ontology, abbreviation of the ontology it belongs to, and lexical similarity

score from comparison to the input query.

7.3.5 Auto-completion Search

If the user presses any key other than ‘Return’, the query is processed by perfor-

mAsynchronousQuery.js to produce auto-completion. Auto-completion requires

that the page is not reloaded. The JavaScript function performAsynchronous-

Query() uses AJAX to send an asynchronous query request to performQuery.php.

The performQuery.php queries the MySQL database and returns an array of the

results as JSON objects (see Fig. 7.6), which, in turn, are fed to typeahead.js

to create the auto-completion pop-up menu, as seen in Fig. 7.7. Each entity in

the auto-completion pop-up menu is dedicated to a single term. It presents four

different types of information about it. On the top-left part, the term name that

best matches the query is shown. This is not always the preferred-name for the

term. For this reason, the lower left part of the entity always holds the preferred

term name for the concept. The lower-right hand side hosts the abbreviation

of the ontology/terminology from where the term is extracted. Finally, at the

upper-right hand side, the lexical similarity to the input query is shown. For this

91

Figure

7.5:

Pressin

gth

e‘R

eturn

’key

orclick

ing

the

‘Sea

rch’

bu

tton

sub

mits

the

qu

eryto

index.p

hp

and

atab

leof

searchresu

ltsis

add

edto

the

interface.

92


Figure 7.6: Part of the JSON response from performQuery.php, for the input query ‘rash’.

Each JSON object represents a term matching the query, and contains information that can be

used for its presentation.

Figure 7.7: Pressing any other key except ‘Return’ submits the query through AJAX to

performQuery.php and an auto-completion pop-up menu is created from the JSON response.

93


project, the maximum number of entities that the auto-completion pop-up menu

can contain has been set to 8.

7.4 Error Correction

If no term matches are found for the input query, the application tries to guess

the intended query and match it to the closest term in the CONCEPTS and

SYNONYMS database. Returning a ‘No results’ screen was not preferred, as it is

not helpful and can cause frustration to the user. The application uses soundex

keys to perform elementary error correction for terms that sound similar, but are

spelt differently due to user error. An example is shown in Fig. 7.8, where the

user input is‘lyng’. Since there are no matches in the database, the application

suggests the term ‘lung’ as a possible correction for the user to choose. The mes-

sage takes the form ‘Did you mean <suggestion> instead of <no result query>?’.

To accept the correction, the user can simply click on the provided link, instead

of having to refine the query in the search box.

94

Figure

7.8:

Err

orco

rrec

tion

wh

enin

pu

tqu

ery

is‘l

yn

g’.

Th

ecl

ose

stte

rmis

sugges

ted

,as

acl

icka

ble

lin

k.

95


7.5 Term Information Presentation

Once the user selects a term, either from the table of results or from the auto-

completion pop-up menu, the terminfo.php script is called. The script accepts

four different types of information about the term:

1. term name,

2. code,

3. preferred concept name,

4. ontology it belongs to.

This information is passed using the GET method. The terminfo.php script

produces an XHTML page which presents this information (see Figures 7.10-

7.11). Furthermore, using the term code, the SIMILARITY table is queried to

look for highly similar terms to the currently viewed term1.

Using the D3 JavaScript library, the returned terms are mapped to SVG

circles, the size of which differs, depending on their semantic similarity score to

the currently viewed term. These circles are organized in a spiral, whose central

terms are the most similar to the currently viewed term. As we move towards

the edge of the spiral, terms become less and less similar to the viewed term.

Thus, larger circles reside at the center of the spiral, and their size decreases as

we move out to the periphery. Inside each circle, a substring of the term name is

shown. When the user places the mouse cursor over a circle, a tooltip with the

full term name and semantic similarity score to the viewed term is immediately

presented (see Fig. 7.9). When the user clicks on a circle, he/she is redirected to

the particular term’s information page.

Circle size is not the only tool used for classifying terms. It would also be

desirable that the user can distinguish if a term is:

1in the term information figures presented in this thesis, Wu-Palmer semantic similarity is

being used. This can be easily changed in the terminfo.php script.

96

7.5. TERM INFORMATION PRESENTATION

Figure 7.9: When the user places the mouse cursor on a circle, a tooltip immediately appears,

containing the full term name and the semantic similarity score with the viewed term.

1. a descendant,

2. a sibling,

3. an ancestor,

4. not in the hierarchy,

when compared to the current term. To distinguish between the above cases,

different colors are used. Red is used for descendants. Green is used for ancestors

or siblings. Blue is used for terms not in the hierarchy. This last case is not valid

for NCIT (see Fi. 7.10) or ICDv9, but can be observed in MedDRA (see Fig.

7.11). When MedDRA is stripped of the leaf level (i.e., LLT terms), it can be

considered a valid hierarchy. At the same time, the removed LLT terms are not

in any hierarchy anymore, despite the fact that very close relations to PTs exist.

There must be a way to denote this type of similarity. In MedDRA, it is denoted

97


as RQ, meaning related or possibly synonymous terms.

The choice of color has dual usage. Different shades of the same color mean

that:

• due to same color, the terms are all of the same type (e.g. all ancestors of

the viewed term)

• due to different shade, each shade acts as a further grouping, denoting how

semantically close the terms are to the viewed term. For example, ancestor

terms, whose semantic similarity to the viewed term lies between 0.75 and

0.80, will have a lighter shade of green from ancestor terms, whose semantic

similarity to the initial term lies between 0.90-0.95. This color clustering

is a redundant measure; after all, circle size also clusters terms according

to their semantic similarity score. Sometimes, though, circle sizes are very

close, and the eye might be tempted to consider them as equal, so a different

color shade removes this possibility.

In addition to the D3 visualization, an XML representation of the similar terms

is provided as an alternative. It may also be used in older browsers that do

not support the JavaScript libraries used. Each term entry in XML includes

basic term information, such as name and code, and a list of similar terms, as

shown in Fig. 7.12. Finally, the page is equipped with help tooltips, that provide

information about components that are present on the page (see Fig. 7.13).

98

Figure

7.10:

Pre

senta

tion

pag

efo

rth

eN

CIT

term

‘Rec

urr

ent

NS

CL

C’.

On

the

left

sid

e,th

eb

asi

cte

rmin

form

ati

on

issh

own

,alo

ng

wit

h

anX

ML

rep

rese

nta

tion

ofh

igh

lysi

milar

term

s.O

nth

eri

ght

sid

e,a

vis

uali

zati

on

of

hig

hly

sim

ilar

term

sis

pro

vid

ed,

usi

ng

the

D3

Jav

aS

crip

t

lib

rary

.

99


Figure

7.11:

Presen

tation

page

for

the

Med

DR

Aterm

‘Rash

’.T

he

termh

as

veryclo

serela

tions

with

terms

that

aren

otin

the

hierarch

y.T

his

isillu

stratedu

sing

blu

ecolor.

100

7.6. NAVIGATION

Figure 7.12: The XML representation of a term. It includes basic term information and

highly similar terms.

Figure 7.13: Help is provided through tooltips that activate on mouse-over.

7.6 Navigation

The main pages that are presented to the user during a search are only two: in-

dex.php, which acts as the main and results presentation screen, and terminfo.php,

which provides information about a chosen concept. The user can reach a specific

term by performing four different actions:

1. by clicking on a term entry, which appears in the auto-completion pop-up

101


menu (from either index.php or terminfo.php),

2. by clicking on a term entry, which appears in the results table of index.php,

3. by clicking inside a circle in the term visualization tool in terminfo.php,

4. by clicking on a suggested correction term in index.php.

Navigation is further assisted, by exploiting the browser’s built-in functionality.

Navigating through pages can be performed through ‘Back’ and ‘Forward’ but-

tons, or explicitly through the history log of the browser. As far as individual

items are concerned, access to the search box can be achieved through the key-

board, using the ‘Tab’ button. The used jQuery plugins also support commonly

used keyboard shortcuts. As an example, the entries inside the auto-completion

pop-up menu can be selected using the ‘up’ and ‘down’ keys. Pressing the ‘Re-

turn’ key changes the page location to the appropriate term.

102

Chapter 8

Evaluation

The search application, that was developed in this project, is evaluated as follows:

• the failed queries of AstraZeneca’s previous search application are tested

again,

• the application is compared to the BioPortal online search service,

• the application’s potential use is commented on by an AstraZeneca search

specialist.

8.1 Testing the Failed Queries

In this section, the failed queries of the previous search application used at As-

traZeneca are re-tested, using the new search application that was developed in

this project. The failed queries and their reasons for failure have been given in

Tables 5.1 and 5.2 of Chapter 5. The results of testing the same queries with

the newly developed application are summarized in Table 8.1. Only two queries

did not produce better results, ‘DIHS’ and ‘NMDA Antagonist’ (see Figures 8.1

and 8.2), but this was expected behavior already from the specification; these

two terms do not appear in the supported ontologies. They are neither listed as

preferred terms, nor as synonyms, so it is normal that they cannot be found.

103

CHAPTER 8. EVALUATION

From the other terms, ‘Hepatotoxicity’ (see Fig. 8.3), ‘NSCLC’ (see Fig.

8.4) and ‘DRESS Syndrome’ (see Fig. 8.5) appear unambiguously in the auto-

completion pop-up menu, as the user starts typing, so the user can quickly jump

to the desired term page. The query ‘LHRH’ returns two different results, with

preferred names ‘GNRH1 wt Allele’ and ‘Gonadotrophin Releasing Hormone’,

respectively (see Fig. 8.6). The NCIT has listed ‘LHRH’ as synonym for both

concepts, so the user must decide which one is the desired. In contrast to the

previous search application, though, the connection between ‘Gonadotrophin Re-

leasing Hormone’ and ‘LHRH’ is clear (i.e., the former is a preferred name for the

latter), so the user does not question the validity of the results.

Finally, the query ‘VEGFR’ greatly improves the previous application’s search

results (see Fig. 8.7). The term ‘VEGFR’ appears as the best matching entity

in the results list, and contains the similar terms ‘Vascular Endothelial Growth

Factor Receptor 1’, ‘Vascular Endothelial Growth Factor Receptor 2’, ‘Vascular

Endothelial Growth Factor Receptor 3’, which are more specific terms. At this

point, it should be noted that both concepts ‘Vascular Endothelial Growth Factor

Receptor’ and ‘Vascular Endothelial Growth Factor Receptor 1’ contain ‘VEGFR’

as synonym. Since ‘VEGFR’ is the synonym which is closest lexically to the input

query (i.e. 100% match), it is the representative name for both the concepts. This

should not cause confusion, though; in both cases, the representative concept

name is immediately followed by the preferred term name.

104

8.1. TESTING THE FAILED QUERIES

Table 8.1: Testing previously failed queries.

Query Comments

DIHS The term is not found (see Fig. 8.1). This is normal, since

this abbreviation is not listed in the synonyms for the Med-

DRA term ‘Drug-induced hypersensitivity syndrome’.

NMDA

Antagonist

No results (see Fig. 8.2), since the term does not appear in

the currently supported ontologies. Also, no proposed term

for error correction.

Hepatotoxicity The term is found (see Fig. 8.3). The user can see that it

belongs to MedDRA.

NSCLC The term is found (see Fig. 8.4). The preferred name is

listed too.

DRESS

Syndrome

The term is found (see Fig. 8.5). This project’s search

application supports MedDRA LLT terms.

LHRH There are two results for ‘LHRH’ (see Fig. 8.6). Unlike in

the previous search application, the user can now see that

‘Gonadotropin Releasing Hormone’ is a preferred term for

‘LHRH’.

VEGFR Semantic similarity has grouped the similar terms together

(VEGRF-1,VEGFR-2,VEGFR-3) under the term ‘VEGFR’,

which is an enhancement to the previous search application

(see Fig. 8.7). The fact that ‘VEGFR-1’ contains ‘VEFGR’

as synonym in NCIT might confuse matters in the listing,

but the preferred term ‘Vascular Endothelial Growth Fac-

tor Receptor 1’ is also mentioned next to it, immediately

clearing any doubts.

105


Figure 8.1: The term ‘DIHS’ is not found, but this is normal, since it is not part of any of

the supported ontologies. Instead, the term ‘DIOS’ is proposed, in case the user had mispelt

the query.

Figure 8.2: The term ‘NMDA Antagonist’ is not found, but this is normal, since it is not

part of any of the supported ontologies. No soundex match is found, so no error corrections are

suggested.

Figure 8.3: The term ‘Hepatotoxicity’ is shown in the auto-completion dialogue.

Figure 8.4: The term ‘NSCLC’ is shown in the auto-completion dialogue.

106

8.1. TESTING THE FAILED QUERIES

Figure 8.5: The term ‘DRESS syndrome’ is shown in the auto-completion dialogue.

Figure 8.6: The query ‘LHRH’ produces two different 100%-matching results. Unlike in the

previous search application, the user can now see that ‘Gonadotropin Releasing Hormone’ is a

preferred term for ‘LHRH’.

107

Figure

8.7:

Th

eresu

ltsfor

the

qu

ery‘V

EG

FR

’,illu

strate

asem

antic

gro

up

ing

of

4sim

ilar

terms,

nam

ely‘V

EG

FR

’,‘V

ascular

En

doth

elial

Grow

thF

acto

rR

ecepto

r1’,

‘Vascu

lar

En

doth

elialG

rowth

Facto

rR

ecepto

r2’,

‘Vascu

lar

En

doth

elialG

rowth

Factor

Recep

tor3’.

Th

elatter

three

aregro

up

edu

nd

erth

ep

arent

term.

108

8.2. COMPARISON TO BIOPORTAL SEARCH SERVICES

8.2 Comparison to BioPortal Search Services

Among other tools, BioPortal provides an online search form that allows users

to search ontologies and terminologies for terms. Comparison of this project’s

application to BioPortal does not aim to prove one better than the other; clearly,

BioPortal is a complete, multi-feature search application that allows searching

of hundreds of ontologies and terminologies, simultaneously. The intent of the

comparison is to highlight some of the different design choices that this project

has adopted, which could further improve the usability of search services provided

by BioPortal.

The BioPortal search interface is shown in Fig. 8.8. Similarly to this project’s

search application, the interface simply contains a search box. The interface

also offers advanced options, shown in Fig. 8.9. For comparison purposes, the

advanced option to narrow search to NCIT, MedDRA and ICD9CM is used (see

Fig. 8.10).

8.2.1 Auto-completion

BioPortal search does not offer auto-completion through the main search interface

at all. For individual ontologies, BioPortal does offer auto-completion widgets,

but this is not done through the main search interface. Therefore, the user is not

helped throughout the procedure, and needs to press the ‘Return’ key to check

whether the query returns any results at all. Possibly, the justification for not

providing auto-completion could be the large number of hosted ontologies, 353

in number, as of August 2013. On the other hand, even when the user chooses a

very small subset of ontologies to search, again no auto-completion is provided.

Let us consider the auto-completion widgets for individual ontologies. The

widget for NCIT is chosen and ‘nsc’ is typed. The auto-completion pop-up menu

is shown in Fig. 8.11. This project’s auto-completion results for ‘nsc’ are shown in

Fig. 8.12. It can be observed that many of the terms present in BioPortal’s auto-

completion menu do not even contain ‘nsc’. BioPortal chooses to show only the

109

110 CHAPTER 8. EVALUATION

Figure 8.8: The BioPortal interface is a simple text box, similar to this project’s main page.

Figure 8.9: BioPortal also offers advanced options to improve the search results.


Figure 8.10: Only NCIT, MedDRA and ICD9CM are chosen for searching, out of the 353

ontologies offered by BioPortal, so that comparisons to this project’s work are achievable.

preferred names for terms. Indeed, let us consider the example of ‘Becatecarin’,

shown third in BioPortal’s auto-completion menu. This term is a preferred name,

whose synonym list includes the term ‘NSC 655649’. Clearly, the search for ‘nsc’

matches ‘NSC 655649’, but instead of returning that term, BioPortal chooses to

return its preferred name, ‘Becatecarin’. Then, it is annotated as ‘synonym’,

stating that a synonym for the matching term is returned. For an inexperienced

user, this is not clear. Unless the user knows every synonym of a given concept, it

might be confusing to see result terms that do not even contain the search words.

This project’s application has alleviated this problem. Both the lexically

closest term to the query and its preferred name are shown, so the user cannot

doubt the result. This is very helpful in cases where the synonyms are highly

dissimilar. For example, the term with preferred name ‘Denatonium Benzoate’,

can be sought by any of its diverse synonyms: ‘THS-839’, ‘WIN 16568’, ‘Aversion’,

‘Anispray’ and ‘Lidocaine Benzyl Benzoate’ (see Figures 8.13-8.15).

8.2.2 Results Ranking

The main search application of BioPortal ranks results, depending on the on-

tology they belong to. Let us examine the complete search results for ‘nsclc’,

both in BioPortal’s application (see Fig. 8.16) and this project’s application (see

Fig. 8.17). BioPortal presents the closest preferred term name, and groups the

remaining results from the same ontology under this term. Each term holds a

single entity, and no hints are given about possible connections among terms. On

the other hand, our application does not group all the results of the same ontol-

ogy together. It provides another type of results grouping, according to semantic

111


Figure 8.11: Auto-completion pop-up menu of BioPortal NCIT widget when the user has

typed ‘nsc’. Only preferred terms are shown. The user might be confused when seeing the term

‘Becatecarin’ in the results, since it does not contain ‘nsc’.

Figure 8.12: Auto-completion pop-up menu of this project’s search application when the user

has typed ‘nsc’.

similarity. The user can, then, see which terms are indeed very close semantically.

The extra semantic grouping does come at the cost of extra computational power

at the server side.

112


Figure 8.13: Searching for ‘Denatonium Benzoate’ through its preferred term name.

Figure 8.14: Searching for ‘Denatonium Benzoate’ through its synonym ‘THS-839’.

Figure 8.15: Searching for ‘Denatonium Benzoate’ through its synonym ‘WIN 16568’.

8.2.3 Error Correction

Error correction is not supported in BioPortal search. If the user misspells

even a letter in the query, a ‘No Matches Found’ message will appear. In this

project’s search application, soundex-based error correction is used to correct

simple spelling mistakes. The application suggests a term that might match the

intended user query. The user can simply click on the term, and is immediately

reassured that the term exists. Otherwise, the user would be uncertain, and

113


Figure 8.16: BioPortal search results rankings for ‘nsclc’. All terms are grouped according to

the ontology they belong to, under the preferred name of the most lexically-relevant term to

the query.

would possibly refer to external sources, such as Google, to identify any possible

errors. Figures 8.18-8.21 illustrate how erroneous queries are handled in the two

applications. The terms ‘nsclca’ and ‘caancer’ are used as queries. BioPortal’s

application does not offer any error correction, while our application offers the

suggestion of terms ‘nsclc’ and ‘cancer’.

8.2.4 Visualization

BioPortal includes a visualization for each term, which illustrates the term’s po-

sition in the hierarchy (see Fig. 8.22). In our application, the visualization is

simplified, and does not refer to formal logic syntax (e.g. subclassOf). Our

114


Figure 8.17: This project’s search results rankings for ‘nsclc’. Terms in the results are

rearranged into groups that show high semantic similarity.

Figure 8.18: BioPortal returns no search results for the erroneously spelt term ‘nsclca’.

Figure 8.19: BioPortal returns no search results for the erroneously spelt term ‘caancer’.

application attempts to hide the underlying ontology and simplify the data vi-

sualization, so that inexperienced users can search without being consumed by

115


Figure 8.20: This project’s search application returns a search suggestion of ‘nsclc’ for the

erroneously spelt term ‘nsclca’.

Figure 8.21: This project’s search application returns a search suggestion of ‘cancer’ for the

erroneously spelt term ‘caancer’.

Figure 8.22: BioPortal uses a graph to visualize hierarchical relations. Edges are annotated

with a description of the relationship between the connected nodes (e.g. subclassOf).

116

8.3. COMMENTS FROM AN ASTRAZENECA SEARCH SPECIALIST

Figure 8.23: This project’s application focuses on inexperienced users and attempts to com-

pletely hide any formal-logic relationships that might confuse the user.

formal-logic references that would puzzle them. (see Fig. 8.23). Allowing users

to choose between the two visualizations would be ideal, so that users of different

levels all benefit.

8.3 Comments from an AstraZeneca Search Spe-

cialist

This second part of the evaluation attempts to examine the search application’s

potential use in the area of medical knowledge acquisition. A short interview

was conducted with a search specialist in research and development information

at AstraZeneca. The search specialist is a researcher, responsible for running

literature searches that ensure patient safety and other functions (e.g. the pre-

diction of drug efficacy and safety at an early stage during drug development).

117


Figure 8.24: Search results depicting causal associations between smoking and cancer, as

presented by the I2E text mining application.

In particular, the search specialist needs to examine the presence of certain term

relationships and patterns in a corpus of medical research documents, which are

retrieved from databases such as ‘Clinicaltrials.gov’.

Efficient full-text search can be achieved through a text mining application

named I2E, developed by Linguamatics. This tool features natural language pro-

cessing (NLP)-based querying. It receives an NLP query as input, searches a

predefined collection of documents, and presents the relevant results in a struc-

tured format. As an example, let us assume that the searcher wishes to search

through a list of medical documents for associations of smoking and cancer. The

terms ‘smoking’ and ‘cancer’ are entered, along with the base form of the verb

‘cause’, to denote the association. The results are shown in Fig. 8.24. Each result

row indicates the document in which the specified hit appears, and provides a

textual excerpt of its context within the document. The tool also features plain

search for terms within a set of ontologies, as shown in Fig. 8.25. Each result row

contains the preferred term name, code and path of the term’s parent to root.

To achieve full results coverage, the search specialist needs to ensure that all

possible variations of the input query have been examined. For example, an input

query of the form ‘has adverse event been seen in MEK inhibitors?’ should con-

sider all possible synonyms of terms that compose the query. The term ‘MEK in-

hibitor’ may be present in literature in various forms, including ‘MKK Inhibitor’,

‘MAPK/ERK Kinase Inhibitor’, ‘MAP2K Inhibitor’, and ‘MAPKK Inhibitor’.

The term ‘adverse event’ may also be found as ‘AdverseEvent’, ‘Adverse Expe-

rience’ or ‘AE’. Similarly, the verb ‘cause’ might as well be replaced by similar

118

8.3. COMMENTS FROM AN ASTRAZENECA SEARCH SPECIALIST

Figure 8.25: Search results for the term ‘MEK inhibitor’ in NCIT, when the I2E application

is used.

verb base forms such as ‘associate’ or ‘result’. Furthermore, when the number of

results is too large, the search specialist should be able to quickly refine the input

query and target it to more specific terms.

The search application developed for this project can assist in finding syn-

onyms for biomedical terms, and in quickly changing the granularity of searches.

Each term page presents a complete list of synonyms for that term, retrieved

from an up-to-date version of the ontology that the term belongs to. Further-

119


more, visualizations offer quick browsing of similar terms, both of higher and

lower specificity. For example, by following red circles, the searcher can delve

deeper into the hierarchy and immediately view information about more specific

terms, without need for re-searching.

The search expert’s comments about the application were very positive. It

was commented that the application would be very helpful for refining queries

before feeding them to a tool like I2E. The interface was considered simple and

the search procedure intuitive. The auto-completion feature and the presence of

lexical similarity scores in the rankings greatly simplified the search procedure,

and allowed the search specialist to quickly reach her goal and focus on the result,

and not on the means to reach the result. Visualization of suggested terms was

valued most of all. Through the developed application, the search specialist could

easily browse neighborhoods of similar terms and refine the search granularity on-

demand. The usage of colors instead of typical expanding menu hierarchies was

also complimented for its usability.

120

Chapter 9

Conclusions and Future Work

Ontologies are expected to play a major role in the discovery of new knowledge

within the biomedical sector. Providing user-friendly tools that help researchers

navigate efficiently through ontologies, without requiring from them to fully com-

prehend ontological principles, is more likely to help them reach their final goals

quickly, without confusion and frustration.

9.1 Conclusions

In this thesis, proposals have been made for enhancing the user experience in

ontological search, through the design of a search application that features en-

hanced searching tools such as auto-completion, semantic grouping of results,

query reformulation and similar concept suggestion. The outcome is a web-based

application that allows searching and browsing ontologies of heterogeneous struc-

ture and format. The web application utilizes all the latest web technologies to

produce a user-friendly environment.

Focus has been given on promoting usability and positive user experience, by

designing the search service from a user-centric perspective, such that even inex-

perienced users can become quickly acquainted with it. The search application

relies heavily on pre-calculated semantic similarity scores; semantic similarity al-

121

CHAPTER 9. CONCLUSIONS AND FUTURE WORK

lows expressing the relationships between terms as decimal numbers, in the range

[0, 1]. Mapping term relations to real numbers allows for the development of the

innovative visualizations and results clustering that are used in this application.

The chosen design for the search application manages to improve certain aspects

that even enterprise-strength ontological search applications, such as BioPortal,

have not considered yet.

9.2 Future Work

The application can be further improved in the following ways:

• it may be connected to other medical applications. For example, it may

assist in directly feeding lists of terms to text mining applications.

• it may be enhanced to accept ontologies of OWL and Open Biomedical

Ontologies (OBO) formats. Currently, BioPortal versions of ontologies are

used to populate the local database, so the application relies on BioPortal.

• more features may be added to the interface, including advanced options

for searches, such as searching by code or searching only specific ontologies.

• the update of ontology versions and calculation of semantic similarities

could be automated, by checking BioPortal at fixed time intervals.

• it may be improved to be compatible with previous versions of web browsers.

Since it relies heavily on JavaScript and novel libraries, alternative meth-

ods for presenting visualizations might be needed. Currently, it has been

successfully tested in the latest versions of all major browsers.

122

Bibliography

Adamusiak, T., Burdett, T., Kurbatova, N., van der Velde, K. J., Abeygu-

nawardena, N., Antonakaki, D., Kapushesky, M., Parkinson, H., and Swertz,

M. A. (2011). Ontocat–simple ontology search and integration in java, r and

rest/javascript. BMC bioinformatics, 12(1):218.

Al-Mubaid, H. and Nguyen, H. A. (2006). A cluster-based approach for semantic

similarity in the biomedical domain. In Engineering in Medicine and Biology

Society, 2006. EMBS’06. 28th Annual International Conference of the IEEE,

pages 2713–2717. IEEE.

Ananiadou, S. and McNaught, J. (2006). Text mining for biology and biomedicine.

Artech House Boston, London.

Anick, P. and Kantamneni, R. G. (2008). A longitudinal study of real-time

search assistance adoption. In Proceedings of the 31st annual international

ACM SIGIR conference on Research and development in information retrieval,

pages 701–702. ACM.

Bates, M. J. (1989). The design of browsing and berrypicking techniques for the

online search interface. Online Information Review, 13(5):407–424.

Belkin, N. J., Kelly, D., Kim, G., Kim, J.-Y., Lee, H.-J., Muresan, G., Tang, M.-

C., Yuan, X.-J., and Cool, C. (2003). Query length in interactive information

retrieval. In Proceedings of the 26th annual international ACM SIGIR con-

123

BIBLIOGRAPHY

ference on Research and development in informaion retrieval, pages 205–212.

ACM.

Ceusters, W., Smith, B., and Goldberg, L. (2005). A terminological and on-

tological analysis of the nci thesaurus. Methods of information in medicine,

44(4):498.

Chen, S., Ma, B., and Zhang, K. (2009). On the similarity metric and the distance

metric. Theoretical Computer Science, 410(24):2365–2376.

Clarke, C. L., Agichtein, E., Dumais, S., and White, R. W. (2007). The influence

of caption features on clickthrough patterns in web search. In Proceedings of

the 30th annual international ACM SIGIR conference on Research and devel-

opment in information retrieval, pages 135–142. ACM.

Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. Wiley-

interscience.

Cucerzan, S. and Brill, E. (2004). Spelling correction as an iterative process

that exploits the collective knowledge of web users. In Proceedings of EMNLP,

volume 4, pages 293–300.

Davis, R., Shrobe, H., and Szolovits, P. (1993). What is a knowledge representa-

tion? AI magazine, 14(1):17.

Dennis, S., Robert, M., and Bmza, P. (1998). Searching the world wide web

made easy? the cognitive load imposed by query refinement mechanisms. In

Proceedings of ADCS 98 Third Australian Document Computing Symposium,

page 65.

Franzen, K. and Karlgren, J. (2000). Verbosity and interface design. SICS Re-

search Report.

124

BIBLIOGRAPHY

Gangemi, A., Pisanelli, D., and Steve, G. (1998). Ontology integration: Experi-

ences with medical terminologies. In Formal ontology in information systems,

volume 46, pages 98–94. IOS Press, Amsterdam, AM.

Gomaa, W. H. and Fahmy, A. A. (2013). Article: A survey of text similarity

approaches. International Journal of Computer Applications, 68(13):13–18.

Published by Foundation of Computer Science, New York, USA.

Gruber, T. R. et al. (1995). Toward principles for the design of ontologies

used for knowledge sharing. International journal of human computer stud-

ies, 43(5):907–928.

Guarino, N. (1998). Formal Ontology in Information Systems: Proceedings of

the 1st International Conference June 6-8, 1998, Trento, Italy, volume 46. Ios

PressInc.

Gusfield, D. (1997). Algorithms on strings, trees and sequences: computer science

and computational biology. Cambridge University Press.

Hearst, M. (2009). Search user interfaces. Cambridge University Press.

Hertzum, M. and Frøkjær, E. (1996). Browsing and querying in online documenta-

tion: a study of user interfaces and the interaction process. ACM Transactions

on Computer-Human Interaction (TOCHI), 3(2):136–161.

Huang, C.-r., Calzolari, N., Gangemi, A., Lenci, A., Oltramari, A., and Prevot, L.

(2010). Ontology and the Lexicon: A Natural Language Processing Perspective.

Cambridge University Press Cambridge.

Hustadt, U. et al. (1994). Do we need the closed-world assumption in knowledge

representation. Working Notes of the KI, 94:24–26.

Jansen, B. J., Spink, A., and Koshman, S. (2007). Web searcher interaction

with the dogpile.com metasearch engine. Journal of the American Society for

Information Science and Technology, 58(5):744–755.

125

BIBLIOGRAPHY

Jansen, B. J., Spink, A., and Pedersen, J. (2005). A temporal comparison of al-

tavista web searching. Journal of the American Society for Information Science

and Technology, 56(6):559–570.

Jaro, M. A. (1989). Advances in record-linkage methodology as applied to match-

ing the 1985 census of tampa, florida. Journal of the American Statistical

Association, 84(406):414–420.

Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics

in medicine, 14(5-7):491–498.

Jiang, J. and Conrath, D. (1997). Semantic similarity based on corpus statistics

and lexical taxonomy. In Proc. of the Int’l. Conf. on Research in Computational

Linguistics, pages 19–33.

Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. (2005). Accu-

rately interpreting clickthrough data as implicit feedback. In Proceedings of

the 28th annual international ACM SIGIR conference on Research and devel-

opment in information retrieval, pages 154–161. ACM.

Jones, W., Dumais, S., and Bruce, H. (2002). Once found, what then? a study

of keeping behaviors in the personal use of web information. Proceedings of the

American Society for Information Science and Technology, 39(1):391–402.

Jurafsky, D. and Martin, J. H. (2000). Speech & Language Processing. Pearson

Education India.

Kukich, K. (1992). Techniques for automatically correcting words in text. ACM

Computing Surveys (CSUR), 24(4):377–439.

Leacock, C. and Chodorow, M. (1998). Combining local context and wordnet sim-

ilarity for word sense identification. WordNet: An electronic lexical database,

49(2):265–283.

126

BIBLIOGRAPHY

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions,

and reversals. Technical Report 8.

Li, M., Zhang, Y., Zhu, M., and Zhou, M. (2006). Exploring distributional

similarity based models for query spelling correction. In Proceedings of the 21st

International Conference on Computational Linguistics and the 44th annual

meeting of the Association for Computational Linguistics, pages 1025–1032.

Association for Computational Linguistics.

Li, Y., Bandar, Z. A., and McLean, D. (2003). An approach for measuring se-

mantic similarity between words using multiple information sources. Knowledge

and Data Engineering, IEEE Transactions on, 15(4):871–882.

Liu, H., Johnson, S. B., and Friedman, C. (2002). Automatic resolution of am-

biguous terms based on machine learning and conceptual relations in the umls.

Journal of the American Medical Informatics Association, 9(6):621–636.

McGuinness, D. L., Van Harmelen, F., et al. (2004). Owl web ontology language

overview. W3C recommendation, 10(2004-03):10.

Miller, G. A. (1995). Wordnet: a lexical database for english. Communications

of the ACM, 38(11):39–41.

Muramatsu, J. and Pratt, W. (2001). Transparent queries: investigation users’

mental models of search engines. In Proceedings of the 24th annual international

ACM SIGIR conference on Research and development in information retrieval,

pages 217–224. ACM.

Navarro, G. (2001). A guided tour to approximate string matching. ACM com-

puting surveys (CSUR), 33(1):31–88.

Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet,

C., Rubin, D. L., Storey, M.-A., Chute, C. G., et al. (2009). Bioportal: on-

tologies and integrated data resources at the click of a mouse. Nucleic acids

research, 37(suppl 2):W170–W173.

127

BIBLIOGRAPHY

Obendorf, H., Weinreich, H., Herder, E., and Mayer, M. (2007). Web page re-

visitation revisited: implications of a long-term click-stream study of browser

usage. In Proceedings of the SIGCHI conference on Human factors in comput-

ing systems, pages 597–606. ACM.

Petrakis, E. G., Varelas, G., Hliaoutakis, A., and Raftopoulou, P. (2006). X-

similarity: computing semantic similarity between concepts from different on-

tologies. Journal of Digital Information Management, 4(4):233.

Rada, R., Mili, H., Bicknell, E., and Blettner, M. (1989). Development and

application of a metric on semantic nets. Systems, Man and Cybernetics, IEEE

Transactions on, 19(1):17–30.

Resnik, P. (1995). Using information content to evaluate semantic similarity in a

taxonomy. arXiv preprint cmp-lg/9511007.

Rodrıguez, M. A. and Egenhofer, M. J. (2003). Determining semantic similarity

among entity classes from different ontologies. Knowledge and Data Engineer-

ing, IEEE Transactions on, 15(2):442–456.

Rodrıguez, M. A., Egenhofer, M. J., and Rugg, R. D. (1999). Assessing semantic

similarities among geospatial feature class definitions. In Interoperating Geo-

graphic Information Systems, pages 189–202. Springer.

Ruthven, I. and Lalmas, M. (2003). A survey on the use of relevance feedback for

information access systems. The Knowledge Engineering Review, 18(02):95–

145.

Sanchez, D., Batet, M., and Isern, D. (2011). Ontology-based information content

computation. Knowledge-Based Systems, 24(2):297–303.

Sanchez, D., Sole-Ribalta, A., Batet, M., and Serratosa, F. (2012). Enabling

semantic similarity estimation across multiple ontologies: An evaluation in the

biomedical domain. Journal of Biomedical Informatics, 45(1):141–155.

128

BIBLIOGRAPHY

Schulz, S., Schober, D., Tudose, I., and Stenzhorn, H. (2010). The pitfalls of

thesaurus ontologization–the case of the nci thesaurus. In AMIA Annual Sym-

posium Proceedings, volume 2010, page 727. American Medical Informatics

Association.

Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic information content metric

for semantic similarity in wordnet. In ECAI, volume 16, page 1089. Citeseer.

Sutcliffe, A. and Ennis, M. (1998). Towards a cognitive theory of information

retrieval. Interacting with computers, 10(3):321–351.

Tauscher, L. and Greenberg, S. (1997). How people revisit web pages: Empiri-

cal findings and implications for the design of history systems. International

Journal of Human-Computer Studies, 47(1):97–137.

Tversky, A. et al. (1977). Features of similarity. Psychological review, 84(4):327–

352.

Tversky, A. and Kahneman, D. (1975). Judgment under uncertainty: Heuristics

and biases. Springer.

VHA, V. H. A. (2012). National Drug File Reference Terminology (NDF-RT)

Documentation. U.S. Department of Veterans Affairs.

WHO, W. H. O. (1992). International Statistical Classification of Diseases and

Related Health Problems, Tenth Revision: Introduction; list of three-character

categories; tabular list of inclusions and four-character subcategories; morphol-

ogy of neoplams; special tabulation lists for mortality and morbidity; definitions;

regulations. World Health Organization.

Winkler, W. E. (1999). The state of record linkage and current research problems.

In Statistical Research Division, US Census Bureau. Citeseer.

129

BIBLIOGRAPHY

Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In Proceed-

ings of the 32nd annual meeting on Association for Computational Linguistics,

pages 133–138. Association for Computational Linguistics.

Zhang, H. and Zhao, S. (2011). Measuring web page revisitation in tabbed brows-

ing. In Proceedings of the 2011 annual conference on Human factors in com-

puting systems, pages 1831–1834. ACM.

Zhou, Z., Wang, Y., and Gu, J. (2008). A new model of information content

for semantic similarity in wordnet. In Future Generation Communication and

Networking Symposia, 2008. FGCNS’08. Second International Conference on,

volume 3, pages 85–89. IEEE.

Zhu, S., Zeng, J., and Mamitsuka, H. (2009). Enhancing medline document clus-

tering by incorporating mesh semantic similarity. Bioinformatics, 25(15):1944–

1951.

130