Project Background Report COMP60990
Student Name: Thamer Omer Ba-Dhfari
Supervisor: Prof. Andy Brass
Date: 9 May 2011
Hypothesis Formulation in Ontology Space Applications to large and complex data from primary care
II
Table of Contents
List of Tables .......................................................................................................... III
Abstract .................................................................................................................. IV
1. Introduction ....................................................................................................... 1
1.1. Need of the Study ......................................................................................... 1
1.2. Report Overview ........................................................................................... 2
2. Background Research ...................................................................................... 2
2.1. Medical Classification and Coding Systems .................................................. 2
2.1.1. Read Codes ............................................................................................... 3
2.2. Pharmacogenomics and Drug Response ...................................................... 4
2.3. Semantic Similarities ..................................................................................... 5
2.3.1. Survey of Semantic Similarities Measures ................................................. 5
2.3.2. Implementation Tools for Semantic Similarity Measures ......................... 12
2.4. Multidimensional scaling ............................................................................. 12
2.4.1. Principal Component Analysis ................................................................. 12
3. Methodology ................................................................................................... 14
3.1. Problem Statement ..................................................................................... 14
3.2. Aims and Objectives ................................................................................... 14
3.3. System Analysis .......................................................................................... 15
3.3.1. Literature Survey...................................................................................... 15
3.3.2. Strategy of Literature Search ................................................................... 16
3.3.3. Sources of Data and Data Collection ....................................................... 16
3.3.4. Project Plan ............................................................................................. 17
3.4. Proposed Work Strategy ............................................................................. 17
3.4.1. Project Data Set ....................................................................................... 18
3.4.2. Semantic similarity ................................................................................... 18
3.4.3. Principal Components .............................................................................. 19
References............................................................................................................ 20
Appendix A – Project Gantt chart .......................................................................... 23
Project Background Report
III
List of Tables
2.1.1 A hierarchy in the Read Codes.......................................................................4
Project Background Report
IV
Abstract
IT has contributed in many areas of modern health care systems. One major
breakthrough has been the advent of electronic patient records (EPR), these
records contain valuable information regarding patients‟ history of disease,
medications and laboratory data. Electronic medical records facilitate the
process of storage and retrieval of such information. EPR have now been
adapted in many healthcare systems around the world and the UK was one of
the first adopters of this electronic technology, with which almost all primary
care data is captured and stored in electronic records.
However, these records need more investigation and analysis to give answers
to problems relating to pharmaceutical development. One of these problems
concerns the ability to predict how individuals respond to certain drugs. This
could be done by in depth study of the records of individual patients, but due
to the high complexity and large volumes of these records, new approaches
are needed. Approaches such as semantic similarity and principal
components are used to overcome the problems mentioned above.
In this study, we propose a method that could possibly work for such
problems. The project data set is given by project partners and contains a
large number of patient records; around 250,000. These records are captured
and described in forms of Read codes. The first stage of the proposed method
is to apply semantic similarity measures to map them into a vector space. The
next stage is to apply principal component analysis to the vector space to map
the data into a simple metric space, allowing us to give a visualisation of this
data and giving the ability to apply different data mining approaches.
Project Background Report
1
1. Introduction
Patient medical history was recorded in paper-based medical records until they were
recently transformed into computerised records, known as electronic patient records
(EPR). EPR are used to keep track of patient history of diseases and medications,
which can sometimes be effective help in diagnosing diseases based on the history
recorded in these records. Moreover, medical information stored in EPR is complex
and rich; in their website, the health information management systems society
(HIMSS)[1] defines EPR as follows:
“The Electronic Health Record (EHR) is a longitudinal electronic record of
patient health information generated by one or more encounters in any
care delivery setting. Included in this information are patient
demographics, progress notes, problems, medications, vital signs, past
medical history, immunizations, laboratory data and radiology reports.”1
Rector et al. [2] state that EPR contain more than factual information about the
patient, they also include clinician‟s observations at consultations. There are several
coding systems proposed to be used as standards in order to capture, encode and
exchange information between healthcare systems [3]. Read codes are one of these
proposals, and have been used in the UK by almost all primary care practitioners.
General practitioners (GPs) use Read codes to record patients‟ health conditions
during consultation, as well as administrative information used by the National Health
Service (NHS) [4, 5].
1.1. Need of the Study
It is clear that medical records are valuable resources for both the NHS, for planning
future health services, and for researchers for new medical developments. An
adequate understanding of the information included in these records has the
potential to help in discovering and solving many medical problems. In areas such as
drug development, it is essential to know whether a patient has any allergies or even
a genetic variant that could possibly cause an adverse drug reaction [6]. Further
investigation and analysis should therefore be conducted on such medical records.
Due to the structure and high dimensionality of Read codes, it is challenging to
interpret and visualise this data. To achieve this, certain techniques need to be
applied to medical records. One of the most promising techniques that have been
used to date for interpreting medical resources revolves around the use of data
mining notions, such as semantic similarity measures and calculations of principal
components. Both techniques are used in order to map Read codes from their
original structure to a simple metric space, allowing us to conduct more research.
1 N.B. electronic patient records (EPR) are also known as electronic health records (EHR)
Project Background Report
2
1.2. Report Overview
This report has been divided into 2 chapters and it is organised in the following way:
Chapter 2 gives a brief overview of the concepts and topics related to the project
domain and describes how likely they are to contribute in solving the study problem.
Different computational procedures are also explained and discussed.
Chapter 3 states the study problem alongside its aims and objectives. It also
discusses the approach that will be used to achieve the study aims and the
implementation methods and tools.
2. Background Research
2.1. Medical Classification and Coding Systems
The main aim of medical records is to keep track of a patient‟s health history and to
store a complete reference of medications being prescribed. These paper-based
records have been transformed later into electronic records known as electronic
medical records (EMR) [7]. EMR provide an ease of use in either storing or retrieving
medical records. On the other hand, medical terms and concepts, being used in
these records and, generally, in clinical domain, expanded continuously. It becomes
more essential to classify the medical terminology in order to find and retrieve easily
a specific term. Medical terms need to be placed into categories or classes, which
provide a structured grouping of terms and concepts organised on the basis of some
common attribute, quality, or property. Generating a classification of clinical terms
will facilitate the communication between different healthcare departments. As a
result, many applications in medicine and medical information such as statistical
analysis of diseases and clinical decision support systems have adapted
successfully different forms of medical classifications systems [3].
Communication between healthcare departments will become more efficient when
using a medical classification system. However, there is a small chance that one
term could be easily understood in different ways from its original meaning.
Therefore, coding systems have been introduced. A medical coding system is
defined as a system responsible for labelling medical terms with a unique code,
which, later, will allow health professionals to recognise and identify the terms
without any confusion [8]. Medical codes can be numeric or alphanumeric, as we
discuss later; these codes include diagnostic, procedural and pharmaceutical terms
[3]. There are several medical coding systems including ICD2, Read codes and
2 ICD: International Classification of Diseases.
Project Background Report
3
SNOMED CT3. The World Health Organisation (WHO) published the ICD system,
which encodes diagnostic terminology for all general epidemiological, clinical use
and other health administrative communication terms [9], whereas read codes and
SNOMED CT are intended to record the full detail of medical records of patients [10].
In the following section, we discuss Read codes further.
2.1.1. Read Codes
Read codes are comprehensive and arranged in hierarchically structure. In the UK, Read codes are widely used by almost all general practitioners (GPs) since the Read codes are recommended by the Joint Computing Group of the British Medical Association (CGBMA), Royal College of General Practitioners (RCGP) and Primary Health Care Specialist Group [11]. GPs use this type of medical coding system with the help of their computerised systems, which encode multiple patient details including demographics, lifestyle, symptoms, signs, past history of diseases, family history, diagnostic, therapeutic, procedures, medications and a variety of administrative items [12]. This system enables GPs to make an effective use of computer systems to communicate with other IT system such as hospitals. Also, it provides clinicians with ready access to patient records in order to report, audit, research and clinical decision support [13]. The Read code system has evolved through three major versions. The first version was developed in the early 1980‟s by Dr James Read, a general medical practitioner. This version used alphanumeric codes with four characters and included about 57,128 terms and 40,927 concepts [14]. In 1990, a second version was introduced with the same technical properties as first version except that the code structure was extended to 5-bytes and this version was known, also, as 5-byte Read code. This allowed the system to be able to capture more numbers of concepts and to cover more healthcare areas such as secondary care. Furthermore, in its second version, the Read ode system added case sensitivity to its codes characters (A to Z, a to z and 0 to 9). This led to an expansion of the number of codes stored to reach a total of 125914 terms and 88995 concepts. The third version made an attempt to address some of the technical issues in the earlier versions such as hierarchical relationship between codes. However, in spite of these improvements, GPs still use the second version of read codes [11]. The hierarchy structure of the Read code system reflects the number of levels of
detail. For instance, the 5-byte Read code provides 5 levels of detail in a way that
the code will have more detail whenever it distances itself from the root. In table 1,
we see a sample of 5-byte Read code hierarchy structure encoded with 5
alphanumeric characters to represent a specific type of asthma.
3 SNOMED CT: Systematized Nomenclature of Medicine Clinical Terms.
Project Background Report
4
Table 2.1.1 - A hierarchy in the Read Codes [15]
Hierarchy Level
Read Code Term
1 H.... Respiratory system disease 2 H3... Chronic obstructive pulmonary disease 3 H33.. Asthma 4 H331. Intrinsic asthma 5 H3311 Intrinsic asthma with status asthmaticus
2.2. Pharmacogenomics and Drug Response
Pharmacogenomics can be defined as a branch of pharmacology. The term
“pharmacogenomics” comes from two sciences: pharmacology, which is the study of
drug action with living organisms [16], and genomics, which is a discipline in genetics
concerning the study of the entire DNA sequence of organisms and the different
genetic variations [17]. Pharmacogenomics combined these two sciences and aimed
to study different drug responses, which might occur in particular individuals or
ethnic groups depending on their genetic makeup [18]. Furthermore, the evolution of
pharmacogenomics is improving drug actions such as drug response, drug targeting,
drug metabolism and drug development [19]. In addition, other techniques such as
genotyping are used beside pharmacogenomics. Genotyping techniques including
single-nucleotide polymorphism (SNPs) and copy number variation (CNVs) are being
used to understand common genetic factors responsible for different responses to
drugs. However, it is possible that other non-genetic factors, like environmental
factors, also, might affect drug response, which needs further study [20].
Pharmacogenomics have been applied and used in some serious diseases such as
cancer, HIV, diabetes and cardiovascular diseases (CVDs). For example, in CVDs,
pharmacogenomics is concerned with the study of drug metabolising. Specific
enzymes are being investigated like the cytochrome P450 (CYP). In general, CYP
enzymes are responsible for metabolising different classes of drugs including CV
drugs [21]. A better understating of variations in these enzymes might provide a
clearer picture of the variability in drug responses and might lead to new inventions
in drug discovery.
Despite the promise that pharmacogenomics has provided in both pharmacology
and genetic sciences, there are still a few areas in some serious diseases yet to be
discovered. On the other hand, challenges and questions have been asked
regarding whether or not pharmacogenomics would be a suitable method to
prescribe medication based on specific variants in a patient‟s genes instead of the
ordinary way of prescribing. Other issues concern the cost-effectiveness of
pharmacogenomics. For example, when prescribing a drug such as warfarin, which
could have a serious adverse reaction on a patient, genetic tests need to be done in
order to decide the recommended dosage; these tests might be expensive and time-
Project Background Report
5
consuming. Others argue that these tests might reduce the hospitalisation of patients
and, therefore, pharmacogenomics would be cost-effective for both the patient and
the healthcare system [22].
2.3. Semantic Similarities
Semantic similarity is used to calculate how likely it is that concepts or terms are
similar to each other in terms of their meaning or content. It is commonly used for
ontology learning and information retrieval [23, 24]. Currently, one of the most
important applications of semantic similarities is Gene Ontology (GO)4. GO semantic
similarity is used to compare genes and proteins based on the similarity of their
functions [25-28]. Another application of semantic similarities is WordNet5, which is a
lexical database for the English language [29]. WordNet organises the words into
sets of synonyms called synsets and gives the semantic relation between those sets
[30, 31].
Various measures have been proposed in the literature to determine semantic
similarities either between two concepts or between sets of concepts. Measures
used to compare two concepts are classified into two main categories: edge-based
measures and node-based measures, and there are also hybrids measures. In the
following section we present these different measures along with a brief discussion
of their properties.
2.3.1. Survey of Semantic Similarities Measures
Measures between two concepts
o Edge based measures
In an edge or distance based approach, links and types of terms are
considered to be the data sources. In order to calculate the similarity,
this approach depends on the depth of the term in the hierarchy. In
other words, measures, which follow this approach, select from all the
possible paths the shortest path between the terms. Since these
measures depend on the paths and the structural part of the taxonomy,
it becomes difficult to implement such measures in some fields such as
lexical databases.
Researchers have proposed several measures including Wu and
Palmer‟s similarity measure, Leacock and Chodorow‟s similarity
measure, Pozo‟s measure and the IntelliGO measure.
4 http://www.geneontology.org/
5 http://wordnet.princeton.edu/
Project Background Report
6
Wu & Palmer similarity measure (WP Measure)
This measure is based on the number of „is-a‟ relation between
concepts and their most informative ancestor [32]. In this
measure, the depth of the least common subsume (LCS) is
divided by the depth of both terms. The expected results of this
similarity measure range between 0 and 1. It is measured as
follows:
Leacock and Chodorow similarity measures (LC measure)
This measure is based on a calculation of the length of the
shortest path between concepts and the maximum depth in the
taxonomy [33]. It is measured as follows:
o Node based measures
This approach ignores the position of the node and regards the
contents of the node as the data source along with its properties. The
similarity of two terms is computed as a combination of common and
distinctive features of both entities. This approach was used by such as
Resnik, Lin, and Jiang and Conrath in their measures. These measures
are information content (IC) based, and the calculation of the similarity
depends on the frequencies of the two terms involved and that of their
most informative common ancestor (MICA). In contrast, measures such
as GraSM (Graph-based Similarity Measure), which follow different
ways are proposed. GraSM considers the average IC of all disjointed
common ancestors. The next section briefly explains these methods.
Resnik’s Measure
Resnik‟s measures [34] rely on information content (IC). These
measures calculate similarity between two terms by finding IC of
their most informative common ancestor (MICA). The equation
of this metric is as follows:
simRes(c1, c2)= IC (cMICA)
Where: IC(c) = - log P(c)
Project Background Report
7
where P(c) is the probability of the frequency of term c. To get
this frequency, the following should be done
P(c) =
Here, each term is counted along with terms annotated to it or
one of its descendent terms. Then, these are divided by N,
which provides the total number of all terms in the data set
being compared.
The minimum value for it is 0 whilst there is no maximum value.
However, when the MICA of both terms is identical, the
similarity may be the same, which can be considered as
deficiency of this approach [35].
Jiang and Conrath Measure (JC Measure)
After reviewing the deficiency in the previous measure, Jiang
and Conrath [36] suggested a semantic measure. This measure
calculates not only the IC but, also, every concept‟s IC. The
results show the semantic distance rather than the semantic
similarity using the following equation:
As semantic similarity and semantic distance are inverse
relationships, the bigger distance of the concepts means that
there are fewer similarities between the concepts.
A related similarity measure could be derived from the previous
measure as follows:
Lin’s semantic measures
This measure uses the elements of the previous measure. Lin
[37] considers that the distance between the terms and their
common ancestors relate to the distance between the IC of
MICA and the IC of the code. Lin‟s measure is as follows:
Project Background Report
8
The results of this measure ranges between 0 and 1. Also, the
results of the calculated concepts indicate the ratio of the
information shared in common to the IC of concepts being
computed.
GraSM (Graph-based Similarity Measure)
Couto et al. [27] proposed a new approach to find semantic
similarities known as GraSM. This approach avoids using the
concepts‟ most informative ancestor, and, instead, it assumes
that two common ancestors are disjunctive if there are
independent paths from both ancestors to the concept. GraSM
produces lower semantic similarity values since it considers the
average IC of all disjoint common ancestors.
GraSM considers that a1 and a2 represent disjunctive ancestors
of c if there is a path from a1 to c not containing a2 and a path
from a2 to c not containing a1. This could be represented by:
Given two concepts c1 and c2, their common disjunctive
ancestors are the most informative common ancestor of
disjunctive ancestors of c1 and c2, i.e., a1 is a common
disjunctive ancestor of c1 and c2, if for each ancestor, a2 is more
informative than a1, a1 and a2, which are a disjunctive ancestor
of c1 or c2. The equation is as follows:
GraSM defines the shared information of both concepts, c1 and
c2, as the average of IC of their common disjunctive ancestors.
The equation is as follows:
Project Background Report
9
Pesquita et al. [28] showed that using GraSM along with
measures including Jiang and Conrath's, Lin's and Resnik's
resulted in increased performance of semantic similarities in
some data sets, while using data sets could give inconclusive
values.
o Hybrid Measures
Some measures have been proposed in which they derive their
functionality from both edge-based and node-based approaches. They
combine the advantages of both approaches. However, in some cases,
their accuracy might not be close to that perceived by people [35].
Othman’s Measures
Othman [38] proposed a semantic similarity metric, which is a
combination of distance measure and node content. In this
measure edges are weighted by the depth of the node and the
difference in IC between the nodes linked by that edge.
Othman‟s measures are as follows:
Wang’s Measure
This measures semantic similarity values by comparing the
locations of two terms and their semantic relations with their
ancestor terms [39]. The calculation of semantic similarity is
done by adding up the semantic contributions of all common
ancestors in each of the terms and dividing by the total semantic
contribution of each term‟s ancestors to that term. This measure
has been applied successfully to gene ontology (GO). In GO,
terms are represented as directed acyclic graph (DAG), For
example, term A is represented as DAGA = (A, TA, EA), where TA
Project Background Report
10
is the set of terms in DAGA, and EA is the set of edges
connecting the GO terms in DAGA. For any of term t in DAGA =
(A, TA, EA), its semantic value to term A → SA(t) is defined as
follows:
Where term A contributes to its own as 1 (SA(t) = 1) and, we is
the semantic contribution factor for the edge e EA linking term t
with its child term t‟. When getting SA(t) value for all terms in
DAGA, the semantic value of term A and the semantic value
SV(A) is calculated as follows:
Given DAGA = (A, TA, EA) and DAGB = (B, TB, EB) for GO terms,
A and B respectively, the semantic similarity between these two
terms, S(A, B), is calculated as:
Where SA(t) is the semantic value of Go term t related to the
term A, and SB(t) is the semantic value of Go term t related to
the term B.
Measures between two sets of concepts
o Pair-wise
In a pair-wise approach, terms of both objects are paired together and,
then, followed by semantic similarity calculations. Common measures,
using this approach, include an arithmetic average approach,
maximum approach and best-match average.
Arithmetic Average (AVG) approach
In AVG approach [25, 26], the semantic similarity is computed
after pairing all terms of concepts. It uses the following formula:
Project Background Report
11
Maximum (MAX) approach
The similarity calculation process in this approach [25] is
obtained by calculating the maximum similarity between each
term of both sets (A, B) calculated as follows:
Best-Match Average (BMA) approach
In BMA [40], calculations of similarity are done by comparing
every term in first set (A) with a similar term in the second set
(B). The results of these calculations show the best match
average between these two sets. BMA is calculated using the
following formula:
Pesquita et al [28, 41]. suggested that the BMA approach could
be the best among the other pair -wise approaches because it
provided a good balance between MAX and AVG approaches
by considering all terms and not only the most significant
matching.
In summary, there is a considerable amount of semantic similarity measures has
been developed. In order to select certain measure, Pesquita et al. [41] have
discussed in their paper this problem and identify three steps. Firstly, choosing the
correct measure for similarity calculation is based on whether the comparison
between two concepts or two sets of concepts. Secondly, measures are different in
deciding the level of details. For example, measures based on graph such as GraSM
are used in generalised similarity, whereas measures such as best-match average
are used for finding more details in similarity between concepts. Thirdly, by detailed
analysis of given data set terms and sets (or bags) of concepts, the researcher could
decide the proper measure based on results of computations, this could be done by
using some tools that implement different semantic similarity measures at once.
Project Background Report
12
2.3.2. Implementation Tools for Semantic Similarity Measures
Many tools have been developed to calculate semantic similarities measures. Such
tools are divided mainly into three categories: web tools, standalone tools and R
packages. Firstly, web tools are used to compare and calculate semantic similarities
in a simple way without requiring any maintenance or updating, though they offer
only certain options. These tools include FuSSiMeG6, ProteInOn7, G-SESAME8 and
FunSimMat9.
Secondly, standalone tools are more stable and capable than web tools in
calculating complex computations. In comparison to web tools, standalone
applications require local installation and regular updates and maintenance.
Standalone applications include DynGO and UTMGO tool.
R packages are the third type of tools used to find semantic similarities. This type of
tools has the possibility of being embedded with other packages such as
visualization tools or statistical analysis. Examples of these packages include tools
provided by the Bioconductor project such as SemSim10, GOvis11 and csbl.go12.
2.4. Multidimensional scaling
Multidimensional scaling (often abbreviated to MDS) encompasses statistical
methods to reduce the dimensionality of given data sets. This is done by mapping
the distances between data in high dimension space into a lower dimension space.
MDS searches for configuration points in low dimension space, which represent
certain objects, the distances between these points and the corresponding
dissimilarities in the high dimension space [42]. Possible applications of MDS
include data visualisation to identify similarities or dissimilarities in data and by
applying different data mining methods. Different models including metric
multidimensional scaling and non–metric multidimensional scaling have been used
to search for the space and the associated configuration points [43].
2.4.1. Principal Component Analysis
Alternatively, due to the similarity in their work to MDS, some related methods such
as principal components analysis (PCA) have been used. PCA is an effective
statistical tool for dimension reduction of a given high-dimensional dataset. It is used
for data visualization, compression and feature selection and feature extraction [43].
6 http://xldb.fc.ul.pt/rebil/ssm
7 http://xldb.fc.ul.pt/biotools/proteinon
8 http://bioinformatics.clemson.edu/G-SESAME
9 http://funsimmat.bioinf.mpi-inf.mpg.de
10 http://www.bioconductor.org/packages/ 2.2/bioc/html/SemSim.html
11 http://bioconductor.org/ packages/2.3/bioc/html/GOstats.html
12 http://csbi.ltdk.helsinki.fi/csbl.go/
Project Background Report
13
Furthermore, PCA is considered to be a powerful technique for analysing data due to
it is simplicity as a non-parametric method for retrieving information from datasets.
By applying PCA on a particular dataset, we could identify easily the underlying
factors rather than the observed data and the similarities patterns would become
clearer by visualisation [44].
In order to extract the important information from the given dataset, PCA calculates a
set of new values called principal components (PCs), which are obtained as linear
combinations, the dataset uncorrelated variables and observations [45]. Multiple
steps are followed towards calculations of principal components [46]. Firstly, we
provide a dataset; for the purpose of this study, we are given a data set of data
points in dimensional space and . The next step is
to subtract the mean of both and , where all the values have subtracted, and
all the values have subtracted from them. The equation is as follows:
Next, the covariance matrix for the data set with dimensions is calculated using the
following formula:
Where is a matrix with rows and columns, and and is the
dimension.
And
After calculating the covariance matrix, the next step is to find the eigenvectors and
eigenvalues of this matrix. Once eigenvectors are found, they are placed in order by
eigenvalue, highest to lowest where the highest eigenvalue is the first PC of the
dataset. The second PC is computed in such a way that it should be orthogonal to
the first PC and have the next highest eigenvalue; the other principal components
are computed likewise [45]. There are principal components ( ); in order to
be more precious in the results it is important to find all principal components.
However, this could lead to ignoring important information.
There are a number of PCA limitations such as issues related to reduced dimensions
and whether or not PCA is statistically independent. Therefore, several extensions
such as independent component analysis (ICA), principal coordinate analysis (PCA)
and kernel PCA have been proposed to overcome such limitations. ICA has been
tried and tested in different applications where standard PCA has shown insufficient
results such as signal and image processing [44].
Project Background Report
14
PCA can be implemented through different numerical computation tools such as
ViSta13, Scilab14, GNU Octave15 and Weka16, which are free software, and the
commercial statistical software MATLAB17.
3. Methodology
3.1. Problem Statement
Many questions have been posed after the introduction of the concept of
personalised medicine regarding whether or not the traditional way of prescribing
medications is still useful after finding that some treatments could cause severe drug
reactions on a particular group of people [21]. One way to avoid this is by delivering
the right dose of the drug tailored to an individual‟s genetic makeup [47]. This
problem is one of the challenges facing modern day pharmaceutical development.
On the other hand, data in health care systems can have an important impact on
solving such a problem. However, the complexity and the large volumes of medical
data could be an obstacle in terms of interpreting them. Therefore, adapting
approaches from data mining such as semantic similarity and principal components
that show promising results [26, 48, 49] are worth considering in order to come up
with an effective solution for predicting possible drug responses. These two
approaches will be implemented with regard to medical data by mapping them from
an ontology space to a simple metric space to provide ways of effectively visualising
and applying data mining techniques.
3.2. Aims and Objectives
The main aim of the present study is to provide an insight into the invaluable
resources in data that is recorded by primary care health professionals for
developing better ways of predicting how patients respond to particular teatments.
This will lead to the maxmisation of the effectiveness of medication by tailoring
dosages to a particular patient's specific needs. Subsidiary aims will be to evaluate
and validate whether using usful computer science approaches such as semantic
similarity measures and principal component analysis can be promising when it
comes to interpreting such large and complex data that emerges from the work of
GPs, and allowing both visualisation and data mining to be applied.
13
http://forrest.psych.unc.edu/research/index.html 14
http://www.scilab.org/ 15
http://www.gnu.org/software/octave/ 16
http://www.cs.waikato.ac.nz/~ml/weka/ 17
http://www.mathworks.com/products/matlab/
Project Background Report
15
In order to achieve the main aims of the study, we will attempt to achieve the
following objectives:
Developing a better understanding of health care systems and how medical
data is captured, exchanged and encoded.
Understanding the concept of pharmacogenomics and its related applications
to personalized medicine.
Analysing the given data set with regard to its large size and complexity and
its medical coding schema.
Drawing up a work strategy for mapping the project data set.
Implementing and testing the proposed strategy.
Evaluating the performance of the proposed system
Improving of the work strategy based on results and feedback.
Additionally, subsidiary aims are achieved in terms of the following objectives:
Exploring strategies such as SSMs and PCA in order to apply them to the
given data set for hypothesis formulation and data mining.
Undertaking a literature review of the applications of SSMs in Gene Ontology
(GO) and WordNet.
Reviewing the existing semantic similarity metrics to determine those most
appropriate for implementation on the project data set.
Collecting the existing implementation tools for both SSMs and PCA and
determining the most appropriate tools.
3.3. System Analysis
3.3.1. Literature Survey
In healthcare systems, data such as medical records is generated in large volumes
and stored in complex forms. As a result, researchers have found some difficulties in
analysing and investigating such resources. As mentioned earlier, computer science
techniques, on the other hand, have been used in several applications that deal with
huge amounts of data. One of these techniques is known as semantic similarities
measures (SSMs). SSMs are used to map a set of documents or terms into a metric
space based on the likeness of their meaning or semantic content. They have been
successfully applied and used in biomedical ontologies. Another technique is called
principal component analysis (PCA). This is mainly used in the context of this
project as a dimension reduction tool.
A considerable amount of literature has been published on SSMs and PCA. The first
stage of this study involved two literature reviews. Firstly, a systematic review of the
current literature was conducted to identify the existing metrics of semantic
similarities and the ways they are applied. Following the first literature review was
the collection and analysis of studies in order to evaluate a suitable measure to be
Project Background Report
16
implemented on the given data set. This review also included research into the uses
of PCA on high-dimensional data, and a consideration of its implementation tools in
order to determine the software that should be used. Secondly, a survey of literature
relating to medical terminologies, classifications and coding systems was undertaken
in order to fully understand the nature of the medical records in primary care and
how they are organised and structured.
3.3.2. Strategy of Literature Search
The criteria for accepting literature on SSMs were specifically focus on publications
about semantic similarity metrics applied to fields such as gene ontology (GO)18 or
linguistics (WordNet)19. The strategy for accepting these publications was based on
the publication‟s date which was restricted to those published from 2000 onwards.
Other publications were excluded. However, some articles and books relating to
PCA were accepted since this mathematical procedure was developed prior to 2000.
Other criteria have been adopted with regard to searching for, and retrieving,
appropriate medical studies relating to different medical coding systems, focusing
mainly on Read code systems and excluding studies of other coding systems from
the search.
3.3.3. Sources of Data and Data Collection
The sources of data for this study were both primary and secondary. This work
attempted to identify existing knowledge related to SSMs, PCA and Read codes. The
data collection was done by interview with the project partners and a survey of the
available literature.
Primary Data
The collection of the primary data will take place in the second stage of this
study20. The project data set will be obtained from the project partners.
Multiple methods of data collection such as interviews, personal discussions
and observations will be used.
Secondary Data
The process of collecting secondary data took place in the first stage of the
study. Different online databases and libraries were used to review the
literature. These included the John Rylands University Library catalogue21, the
U.S. National Library of Medicine (PubMed)22, the ISI Web of Knowledge23
18
http://www.geneontology.org/ 19
http://wordnet.princeton.edu/ 20
See project plan section. 21
http://catalogue.library.manchester.ac.uk/ 22
http://www.ncbi.nlm.nih.gov/pubmed/ 23
http://wok.mimas.ac.uk/
Project Background Report
17
and the ACM Digital Library24. These databases were searched using the
following search terms: "semantic similarities (measures/metrics/ scores)",
"semantic similarities in gene ontology", “principal component analysis”,
“medical (terminology/classification/coding) systems” and “Read codes”. Also,
hand-searching of related journals and reports was done using Google
Scholar using terms corresponding to those listed above.
3.3.4. Project Plan
The project has two main stages (February 2011 to September 2011). The first stage
was the period from February through to the end of May. In this period we were
supposed to work on three assignments for research methods and the professional
skills course. These assignments included project statements, project plans and a
project website. This stage was intended to create effective and solid background
research. Working on these pieces of coursework also helped in planning and
collecting relevant literature regarding the study topic. On 9th May we are required to
submit a background report containing a review of the literature relating to the project
topic, along with a description of purpose, objectives and the deliverables of the
project.
The second stage starts in June and ends in September with the submission of the
dissertation. After reviewing existing techniques in the first period, the second stage
will be allocated to the implementation of the proposed work strategy. In this stage,
there will be some visits to AstraZeneca Research to receive the project data set that
will help in implementing and testing the proposed methods, and to accomplish the
study objectives. The process of writing the final study report will probably start
during the period from July to September. It will include regular meetings with the
academic supervisor to demonstrate the progress of the work. Full details of the
different project tasks can be found in Appendix A, wherein the plan of this project
has been represented using a Gantt chart.
3.4. Proposed Work Strategy
To achieve the main study aims the following steps are planned:
Step 1: Apply SSM on the data set to map it into semantic space as follows:
Apply the measure to the individual Read codes of a patient, and then
compute the semantic similarity of two sets of the Read code. By repeating
this process on each patient‟s record, a vector of similarities for each patient
will be generated. Another semantic similarity calculation will take place
between two patient records. These two calculations will result in mapping the
data into a vector space.
24
http://portal.acm.org
Project Background Report
18
Step 2: Apply PCA on the data set to reduce the dimensionality as follows:
After mapping patients‟ records into a vector of similarities, we can easily find
the principal components by performing PCA. As a result, data will be mapped
to a low dimensional vector space. Here we can readily use and perform
different machine learning techniques and can visualise this data.
Further details will be described in the following sections.
3.4.1. Project Data Set
The data set obtained from the project partners contains GPs records from Salford, a
city in the North West of England. This data contains anonymised patient records for
nearly a quarter of a million individuals; the data set is huge and complex and
requires many processes in order to apply the proposed methods to it. These
records are captured and described by one of clinical coding scheme such as Read
codes. The structure of Read code records will help us to implement techniques
such as SSMs and PCA, since the given records are encoded in such a way that
each patient is associated with a set of Read codes during a given period of time.
3.4.2. Semantic similarity
Applying semantic similarity measures on the project data set is done through two
steps:
Compute similarity between two Read code of one patient
Here, we compare two Read codes of individual patient to
find the semantic similarity by using GraSm [27] approach with Resnik‟s
measure [34] 25. Resnik defines their measure as the information content (IC)
of the most informative common ancestor (MICA):
Where
We can find the common ancestor of the two concepts c and c by the
following:
The IC of code can be defined as the following:
25
See “Survey of Semantic Similarities Measures” section for more information about both measures.
Project Background Report
19
where
This process will be applied on each record of a patient.
Compute similarity between two patient records:
In this step we will use arithmetic average (AVG) approach to find the
similarity between two patient records, given that
The AVG measure is calculated as follows:
where
where is the similarity between two patient records and .
We calculate the average of the similarity using the measure mentioned in
step 1.
By this stage, we can represent the similarities of patient records obtained
above into ontology space in order to apply the principal component.
3.4.3. Principal Components
In order to achieve the main aims of this study, further calculations should be done
on the generated vector space from the previous step. In other words, the patient
records, which are mapped into similarity space, are still represented as high
dimensional space and we could not perform either machine learning techniques nor
could we visualise them. As discussed earlier in the background research section of
this report, PCA is used in such cases since it offers a dimensionality reduction and
gives a better understanding of variance and the structure of the data [50].
Finally, the data set will be ready by this stage. We will then be able to implement
different data mining methods and also will be able to project these data into a
simple metric space for visualisation.
Project Background Report
20
References
1. Healthcare Information and Management Systems Society. Electronic Health Record. Available from: http://www.himss.org/ASP/topics_ehr.asp [Accessed April 2011]
2. Rector, A., W. Nowlan, and S. Kay. Foundations for an electronic medical record. Methods Inf Med, 1991. 30(3): p. 179-186.
3. Benson, T., Principles of health interoperability HL7 and SNOMED. 2010, London: Springer Verlag.
4. Gillam, S. and A.N. Siriwardena, The Quality and Outcomes Framework: Qof-Transforming General Practice. 2010: Radcliffe Publishing.
5. NHS Confederation and British Medical Association. Investing in general practice: the new GMS contract. British Medical Association, 2003.
6. Nebeker, J.R., P. Barach, and M.H. Samore. Clarifying Adverse Drug Events: A Clinician's Guide to Terminology, Documentation, and Reporting. Annals of Internal Medicine, 2004. 140(10): p. 795-801.
7. Shortliffe, E.H. The evolution of electronic medical records. Academic Medicine, 1999. 74(4): p. 414-419.
8. Coiera, E., Guide to Medical Informatics, the Internet and Telemedicine. 1997: Chapman and Hall, Ltd. 376.
9. World Health Organization (WHO). (WHO), International Classification of Diseases (ICD). 2011. Available from: http://www.who.int/classifications/icd/en/ [Accessed 30 April 2011]
10. Cimino, J. Review paper: coding systems in health care. Methods Inf Med, 1996. 35(4-5): p. 273-84.
11. de Lusignan, S. Codes, classifications, terminologies and nomenclatures: definition, development and application in practice. Informatics in Primary Care, 2005. 13(1): p. 65-70.
12. House of Commons: Health Committee, The electronic patient record: Sixth Report of Session 2006-07. 2007.
13. Booth, N. What are the Read Codes? Health Libraries Review, 1994. 11(3): p. 177-182.
14. Bentley, T.E., C. Price, and P.J.B. Brown. Structural and lexical features of successive versions of the Read Codes. in proceedings of the 1996 Annual Conference of the Primary Health Care Specialist Group of the British Computer Society. 1996. Cambridge: UK.
15. Scottish Clinical Information Management in Primary Care. Read Codes "Making IT Work For You" Good Practice Guide (GPG) : RCGP and CEPpc. 2003. Available from: http://www.scimp.scot.nhs.uk/gpg/doc_page67.shtml [Accessed 25 April 2011]
16. Vallance, P. and T.G. Smart. The future of pharmacology. British Journal of Pharmacology, 2006. 147(S1): p. S304-S307.
17. Griffiths, W.M., et al., An introduction to genetic analysis. 2000, New York, USA: WH Freeman and Company.
18. U.S. Department of Energy Genome Program's Biological and Environmental Research Information System (BERIS). Pharmacogenomics. 08 September 2010. Available from: http://www.ornl.gov/sci/techresources/Human_Genome/medicine/pharma.shtml [Accessed 25 April 2011]
Project Background Report
21
19. Centre for Genetics Education. PHARMACOGENETICS/PHARMACOGENOMICS. 2007 June 2007. Available from: http://www.genetics.edu.au/factsheet/fs25 [Accessed April 2011]
20. Zhang, W., R. Huang, and M. Dolan. Integrating epigenomics into pharmacogenomic studies. Pharmacogenomics and personalized medicine, 2008(1): p. 7-14.
21. Barone, C., S.S. Mousa, and S.A. Mousa. Pharmacogenomics in cardiovascular disorders: Steps in approaching personalized medicine in cardiovascular medicine. Pharmacogenomics and personalized medicine, 2009. 2: p. 59-67.
22. Ginsburg, G.S., M.P. Donahue, and L.K. Newby. Prospects for Personalized Cardiovascular Medicine: The Impact of Genomics. Journal of the American College of Cardiology, 2005. 46(9): p. 1615-1627.
23. Cimiano, P., A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Int. Res., 2005. 24(1): p. 305-339.
24. Muller, C., I. Gurevych, and M. Muhlhauser, Integrating Semantic Knowledge into Text Similarity and Information Retrieval, in Proceedings of the International Conference on Semantic Computing. 2007, IEEE Computer Society. p. 257-264.
25. Lord, P., R. Stevens, and A. Brass. Semantic similarity measures as tools for exploring the gene ontology. in The 8th Pacific Symposium on Bio-computing 2003. 2003.
26. Lord, P., et al. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 2003. 19(10): p. 1275 - 83.
27. Couto, F.M., M.J. Silva, and P.M. Coutinho. Measuring semantic similarity between Gene Ontology terms. Data & Knowledge Engineering, 2007. 61(1): p. 137-152.
28. Pesquita, C., et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics, 2008. 9(Suppl 5): p. S4.
29. Fellbaum, C., Wordnet: an electronic lexical database. 1998, Cambridge: MIT Press.
30. Li, Y., Z.A. Bandar, and D. McLean. An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Trans. on Knowl. and Data Eng., 2003. 15(4): p. 871-882.
31. Varelas, G., et al., Semantic similarity methods in wordNet and their application to information retrieval on the web, in Proceedings of the 7th annual ACM international workshop on Web information and data management. 2005, ACM: Bremen, Germany. p. 10-16.
32. Wu, Z. and M. Palmer, Verbs semantics and lexical selection, in Proceedings of the 32nd annual meeting on Association for Computational Linguistics. 1994, Association for Computational Linguistics: Las Cruces, New Mexico.
33. Leacock, C. and M. Chodorow. Combining Local Context and WordNet Similarity for Word Sense Identification. WordNet: A Lexical Reference System and its Application, 1998: p. 265-283.
34. Resnik, P. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 1999. 11: p. 95 - 130.
Project Background Report
22
35. Songmei, C. and L. Zhao, An Improved Semantic Similarity Measure for Word Pairs, in Proceedings of the 2010 International Conference on e-Education, e-Business, e-Management and e-Learning. 2010, IEEE Computer Society.
36. Jiang, J.J. and D.W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. in International Conference Research on Computational Linguistics (ROCLING X). 1997.
37. Lin, D. An Information-Theoretic Definition of Similarity. in Proc. International Conference on Machine Learning (ICML). 1998.
38. Othman, R.M., S. Deris, and R.M. Illias. A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences. Journal of Biomedical Informatics, 2008. 41(1): p. 65-81.
39. Wang, J.Z., et al. A new method to measure the semantic similarity of GO term. Bioinformatics, 2007. 23(10): p. 1274 - 1281.
40. Schlicker, A., et al. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics, 2006. 7: p. 302.
41. Pesquita, C., et al. Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol, 2009. 5(7): p. e1000443.
42. Cox, T. and M. Cox, Multidimensional Scaling, Second Edition. 2000: Chapman & Hall/CRC.
43. Borg, I. and P.J.F. Groenen, Modern multidimensional scaling: Theory and applications. 2005: Springer.
44. Shlens, J. A tutorial on principal component analysis. Systems Neurobiology Laboratory, University of California at San Diego, 2005.
45. Abdi, H. and L.J. Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2010. 2(4): p. 433-459.
46. Smith, L.I. A tutorial on principal components analysis. Cornell University, USA, 2002. 52.
47. Sadée, W. and Z. Dai. Pharmacogenetics/genomics and personalized medicine. Human Molecular Genetics, 2005. 14(suppl 2): p. R207-R214.
48. Couto, F., M. Silva, and P. Coutinho. Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. In CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, 2005: p. 343 - 344.
49. Pedersen, T., S. Patwardhan, and J. Michelizzi, WordNet::Similarity: measuring the relatedness of concepts, in Demonstration Papers at HLT-NAACL 2004. 2004, Association for Computational Linguistics: Boston, Massachusetts. p. 38-41.
50. Jolliffe, I.T., Principal component analysis. Vol. 2. 2002: Springer Series in Statistics. 487.
Project Background Report
23
Appendix A – Project Gantt chart