PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON PROTEIN
KINASES
by
GURINDER PAL SINGH GOSAL
(Under the Direction of KRZYSZTOF J. KOCHUT and NATARAJAN KANNAN)
ABSTRACT
The prominent role protein kinases play in cell regulation and disease has
given rise to an abundance of information about the structure, function, interactions and
evolution of these proteins. This information, however, is currently spread across several
heterogeneous resources, an obstacle to the kind of integrative approaches needed in
utilizing existing knowledge for research related to diseases. We have designed and
developed an ontology for protein kinases, ProKinO, that serves as a useful and efficient
representation of the integrated knowledge about these complex proteins which are
intimately involved in the genesis and behavior of cancer cells. ProKinO captures
concepts and relationships important to protein kinases and the instances are populated
from disparate resources including KinBase, COSMIC, Protein Data Bank, UniProt and
Pfam. ProKinO has potential applications in text mining in the protein kinase literature;
cancer genome annotation in cancer genome sequencing studies and research related to
protein kinases and associated domains.
INDEX WORDS: Kinase, Ontology, Semantic Web, Text Mining, Annotation.
PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON PROTEIN
KINASES
by
GURINDER PAL SINGH GOSAL
BA, Punjabi University, India, 1991
MCA, Punjabi University, India, 1996
A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial
Fulfillment of the Requirements for the Degree
MASTER OF SCIENCE
ATHENS, GEORGIA
2010
PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON PROTEIN
KINASES
by
GURINDER PAL SINGH GOSAL
Major Professor: Krzysztof J. Kochut
Co-Major Professor: Natarajan Kannan Committee: John A. Miller
Hamid R. Arabnia Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia May 2010
iv
DEDICATION
I would like to dedicate this thesis to my parents Kartar Singh Gosal and Balwant
Kaur Gosal, my wife Ruby, my son Harjap and my entire dear and near ones. I would have
never been able to complete this thesis without their love and support. I would like to make a
very special mention of my ancestral village Kakrala which has always have been a source of
inspiration throughout my life.
v
ACKNOWLEDGEMENTS
I am extremely grateful to my major professor, Dr. Krzysztof J. Kochut, for his
invaluable supervision and encouragement throughout my research and academic study in
The University of Georgia. I am also grateful to my co-major professor, Dr. Natarajan
Kannan, for his role as my guide in the biomedical domain on which my thesis work is
primarily based upon. I would also like to thank Dr. John A. Miller and Dr. Hamid R.
Arabnia for serving on my advisory committee, and for their valuable advice and guidance in
my research.
I would like to thank my fellow researchers in Dr. Kochut’s LSDIS Research Group
in the Computer Science department and also my fellow lab members of in Dr. Kannan’s
Evolutionary Systems Biology Group in the Biochemistry and Molecular Biology
department, for their help and support throughout my period of study and research at The
University of Georgia.
vi
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS .................................................................................................v
LIST OF TABLES ........................................................................................................... viii
LIST OF FIGURES ........................................................................................................... ix
CHAPTER
1 INTRODUCTION...........................................................................................1
1.1 Motivation and Need.............................................................................1
1.2 Contributions ........................................................................................3
1.3 Scope .....................................................................................................4
2 BACKGROUND AND LITERATURE SURVEY .......................................6
2.1 Semantic Web Vision ...........................................................................6
2.2 Ontologies ...........................................................................................17
2.3 Biomedical Knowledge Management .................................................21
2.4 Biomedical Ontologies ........................................................................22
2.5 Biomedical Ontologies Applications ..................................................24
2.6 Information Sources in Protein Kinase Domain .................................27
2.7 Challenges in Integrating Protein Kinase Knowledge ........................30
3 DESIGN OF PROKINO...............................................................................33
3.1 Heterogeneous Knowledge Sources ...................................................33
3.2 ProKinO Design ..................................................................................43
vii
3.3 Architecture of Systems Based on ProKinO ........................................53
4 PROKINO LIFE CYCLE ............................................................................55
4.1 Data Acquisition .................................................................................56
4.2 Data Integration and Ontology Population .........................................60
4.3 Curation and Ontology Modification ..................................................64
4.4 ProKinO Revisions .............................................................................65
4.5 Ontology Dissemination and Evaluation ............................................66
5 POTENTIAL APPLICATIONS OF PROKINO .......................................68
5.1 ProKinO Browsing..............................................................................68
5.2 Text Mining ........................................................................................70
5.3 Cancer Genome Annotation ................................................................71
6 CONCLUSION AND FUTURE WORK ....................................................73
REFERENCES .................................................................................................................76
viii
LIST OF TABLES
Page
Table 1: Kinase distribution by major groups in human systems .....................................36
Table 2: Classes in ProKinO ............................................................................................46
Table 3: Object properties in ProKinO .............................................................................48
Table 4: Data properties in ProKinO ................................................................................51
ix
LIST OF FIGURES
Page
Figure 1: The Semantic Web layered specification ............................................................8
Figure 2: A RDF statement depicting a triple graphically ................................................10
Figure 3: Web Ontology Language (OWL) and its three sublanguages ...........................11
Figure 4: An ontology fragment in OWL ........................................................................14
Figure 5: A view of Open Link’s Virtuoso Generic Endpoint .........................................16
Figure 6: Protein kinases phosphorylation ........................................................................27
Figure 7: Challenges in integrating protein kinase knowledge .........................................31
Figure 8: Human kinome poster .......................................................................................34
Figure 9: The Protein kinase classification in groups by KinBase ...................................35
Figure 10: Pfam display of protein domain architecture ................................................39
Figure 11: Crystal structure of protein kinase gene product EGFR ................................40
Figure 12: The Universal Protein Resource (UniProt) databases .....................................42
Figure 13: Conceptualization of ProKinO ........................................................................45
Figure 14: Architecture of systems based on ProKinO ....................................................53
Figure 15: Jena example to create an ontology model ......................................................57
Figure 16: Data acquisition from ProKinO sources ..........................................................59
Figure 17: A snapshot of Protégé editor showing populated ProKinO ............................62
Figure 18: Knowledge discovery through populated ProKinO ........................................63
Figure 19: A snapshot of elementary browsing of ProKinO ............................................69
1
CHAPTER 1
INTRODUCTION
1.1 Motivation and Need
There has been a tremendous growth in the information about the structure,
function, interaction and evolution of protein kinases which play a very important role in
cellular function and disease. There are more than 50,000 protein kinase sequences (from
diverse organisms), nearly 500 crystal structures, over 500 non-synonymous mutations,
and tens of thousands of published articles on protein kinases and this list is ever
increasing. Integrating and analyzing these existing data can provide a detailed
understanding of the relationships between sequence, structure, function and disease in
the protein kinase family. However, the data and information about protein kinases
domain is spread across different resources with each resource maintaining its own
repository often in different formats. The difficulty in integrating data from these
disparate sources and heterogeneous data formats, has posed major challenges for the
protein kinase community in utilizing existing knowledge for research related to diseases
like cancer.
Ontologies provide a solution to the above mentioned problem because by
representing data as concepts and integrating data by introducing relationships between
these concepts, ontologies provide a framework for capturing, organizing and
representing knowledge in a way that computers can process and also humans can
2
understand. The ontologies are frequently used to deal with the heterogeneity of database
schemas of different information sources by providing a shareable, consistent and formal
description of the semantics [1]. A well identified, conceptualized, designed and
populated ontology representing knowledge about a particular domain can be the basis
for a range of applications serving that domain as well as associated areas that are built
upon the knowledge weaved in ontology [2].
There have been many efforts in the recent past in the biomedical community to
represent fundamental, as well as specialized domain knowledge in the form of
ontologies. Many biomedical ontologies are available in the public domain ranging from
the highly developed Gene Ontology (GO) [3] and the Sequence Ontology project [4] to
very domain specific ontologies, such as those maintained under the umbrella of Open
Biomedical Ontologies Foundry [5]. Highly successful biomedical ontologies such as the
Gene Ontology have served as a vehicle for knowledge representation for the biological
community for over a decade. The Gene Ontology has also been used for various
applications including biomedical literature mining and genome annotation.
There are a few ontologies related to the protein families which are present in
public domain such as in Open Biomedical Ontologies (OBO) Foundry but these are,
however, not able to describe the domain of protein kinases completely. Protein kinases
are a large family of proteins that are implicated in many human cancers and are one of
the very few families that have been extensively studied both from the basic and clinical
point of view. Keeping in view of the immense importance of protein kinases in the
protein family, a greater need is observed to have a specialized ontology for protein
3
kinases. Protein Kinases Ontology (ProKinO) serves as a shared vocabulary to leverage
knowledge that can be used in various useful applications in the protein kinase domain.
1.2 Contributions
There are a number of ontologies in the biological and medical domain which
have performed beyond expectations after their inception. The Open Biomedical
Ontologies (OBO) library contains many of such ontologies which are shared across
different domains. The list of OBO ontologies includes some ontologies which are in
protein related domains such as Protein Ontology (PRO) [6], Protein-protein interaction
[7], and Protein-modification [8], but there exists a gap as far as completely serving the
domain of protein kinases is concerned.
ProKinO is an effort to capture the basic and clinical research data about protein
kinases scattered in disparate and heterogeneous sources and integrate this data in the
form of formal and explicit conceptualization such as ontology. ProKinO has been
developed to capture the current state of knowledge on protein kinase sequence, structure,
function, motif and disease. ProkinO provides a controlled vocabulary of terms, their
hierarchy, and relationships among them in the Protein Kinase domain, unifying
information from heterogeneous resources to provide consistent representation of
sequence, structure, motif, function and mutation data. We have identified entities and
concepts related to the area of interest, i.e., protein kinases in the ProKinO ontology, and
then the relationships between these entities and concepts were identified. ProKinO has
been developed: (i) to formally specify the concepts and their relationships in the domain
of protein kinases to provide a sharable and consistent vocabulary, (ii) to integrate
4
sequence, structure, function, motif and disease information on protein kinases in a
machine readable format, (iii) to allow protein kinase researchers to navigate diverse
forms of data in one place, (iv) to annotate cancer genomes, mine protein kinase
literature, and to allow the development of other important applications focusing on data
exploration and inference in an efficient way.
1.3 Scope
There are many databases and sources available in the protein kinase and
associated biomedical domain and most of these use different terminology and formats to
serve the community. ProKinO automatically extracts information from diverse sources:
KinBase [9]; Catalogue of Somatic Mutations in Cancer (COSMIC) [10]; Protein Data
Bank (PDB) [11]; Protein Families database (Pfam) [12] and The Universal Protein
Resource (UniProt) [13] to populate it to serve as a compendium of specialized
knowledge about protein kinase domain. ProKinO will help to build software applications
based on the knowledge integrated inside it such as annotating the vast amounts of
sequence data generated from cancer genome sequencing studies and mining the wealth
of literature accumulated on protein kinases. This ontology can become the basis for text
mining the wealth of protein kinase literature by not only giving direct access to the facts
stated in the text but also uncovering the indirect relationship between the entities. The
system will go beyond simple keyword searching and provide the user capability to query
the literature with the knowledge of ontology in a way that the information extraction is
more efficient than simple keyword searching. For instance, the synonyms of protein
kinase genes captured in the ontology with object property hasOtherName (e.g., gene
5
AMPa1 is also known with other name PRKAA1) can make the information extraction
from literature related to this domain more pertinent. In the same way the ProKinO is
packed with this type of integrated knowledge which can be utilized for mining the
literature more efficiently. So this will be a text mining and advanced search system for
executing highly specialized queries created with the use of ProKinO and over the
electronically available publications in the area of Protein Kinases. We also plan to focus
on the creation of an automated cancer genome annotation system based on ProKinO. To
consistently and accurately annotate protein kinase mutations in upcoming cancer
genome sequencing studies, the integrated knowledge in ProKinO will be used. This will
make it possible to provide a consistent annotation for protein kinase mutations
discovered in cancer genomes and allow cancer researchers to prioritize mutations for
experimental studies.
6
CHAPTER 2
BACKGROUND AND LITERATURE SURVEY
This chapter describes the background information about the concepts that are
relevant with the design and development of biomedical ontologies. The initial part of the
chapter includes definitions and background information about the Semantic Web vision,
tools and technologies enabling Semantic Web and about ontologies and biomedical
knowledge management. The later part of the chapter focuses on presenting the related
work in the area of biomedical ontologies and their applications.
2.1 Semantic Web Vision
There has been a rapid progress of The World Wide Web since its inception
nearly 20 years ago. The World Wide Web, abbreviated as WWW and commonly known
as The Web, has grown exponentially in size after its coming into light and has changed
at an exponential rate even not imagined by the researchers working in this area [14]. As
of March 2009, the index-able web contains at least 27.08 billion pages [15], Google
Search had discovered one trillion unique URLs [16] and as of March 2010, over 116.9
million websites operated [17]. The Web has matured over the time to be a web of
documents that are readable to humans but difficult for computers to manipulate for
getting the meaningful information. The Semantic Web is a step forward in overcoming
this limitation of Web and of late the attempt is to provide computer processable
7
meanings to the Web. This will greatly enhance its capability to allow machines to use
the Web content. According to Tim Berners-Lee, the father of the Web, “The Semantic
Web is not a separate Web but an extension of the current one, in which information is
given well-defined meaning, better enabling computers and people to work in
cooperation” [18]. The Web has been envisioned as a universal medium for data,
information, and knowledge exchange by Tim Berners-Lee. To achieve success in the
Semantic Web, some fundamental changes in the structure of current Web must be
brought by giving semantics to the information by inducing concepts and the
relationships among them. Apart from the primary function of giving structure to the web
content, the Semantic Web has been striving for improving the task for data integration
across various Web applications and prompting partnerships.
There are many factors that can contribute to the success of this new approach of
expanding the current Web to give understanding to data. There has been a concentrated
and coordinated effort to make the Semantic Web a reality by different collaborative
working groups which follow certain design principles and use different enabling
technologies under the umbrella of the World Wide Consortium (W3C) [19]. One of
these collaborative groups is the W3C Semantic Web Activity Working Group [20]
which has been working on a number of standards in the Semantic Web area. Many
technologies have already made their place in the realization of Semantic Web and many
more are being added to the arsenal because of the larger interest in Semantic Web
research. Several technologies and tools which are being used by the Semantic Web
community are provided mostly by the open source community. In the literature the
8
major Semantic Web technologies have been visualized as a layered specification
popularly represented as a Semantic Web Layer Cake.
The various tools, technologies, standards and conventions which are basic
building blocks of the Semantic Web and their layered organization, are shown in the
Figure 1. The architecture has the Uniform Resource Identifier (URI) as the foundation of
the organization with the Extensible Markup Language (XML) on top of that. Further, the
Resource Description Framework (RDF) and RDF Schema, as well as query language
SPARQL and the Web Ontology Language (OWL) are placed in the middle of the cake.
Figure 1: The Semantic Web layered specification [21]
The upper layers consist of logic, proof and trust which are still being explored.
The digital signature (Cryptography) is relied upon by some layers to ensure security.
9
Some of the important components of the Semantic Web are briefly discussed in the
subsequent sections.
2.1.1 Resource Description Framework (RDF) and RDF Schema
The World Wide Consortium has developed Resource Description Framework
(RDF) as a simple metadata data model for describing and creating relationships among
resources. The Consortium had published a specification of RDF's data model as a W3C
Recommendation in 1999 and the new version was published as a set of related
specifications in 2004. In RDF the information is represented in a minimally restrictive
way and RDF's simplification offers greater sharing. RDF defines a resource as an object
and this object is uniquely identifiable by a Uniform Resource Identifier (URI) which is a
formatted string used for identifying abstract or physical resources. The design of RDF is
based on the goals of having a simple data model, an extensible URI-based vocabulary,
using an XML-based syntax and supporting the use of XML schema data types and
having formal semantics which provide a dependable basis for reasoning about the
meaning of the RDF expressions [22].
The fundamental structure of any expression in RDF is very simple and consists
of a collection of triples. Each triple in RDF is made of a subject, a predicate also called
property and an object. An RDF graph consists of set of such triples and the set of nodes
of an RDF graph is the set of subjects and objects of triples in the graph. Figure 2 depicts
a triple consisting of subject, a gene ACK represented as a URI, having a predicate
hasFunctionalDomain with a property value SH3_2.
H
<
<
S
R
re
cl
C
S
co
em
2
h
u
Here is a com
http://om.cs.u
http://om.cs.u
The R
emantic We
RDFS allows
estrictions an
lasses in the
Cancer class
chema can
onstraints c
mbraces man
.1.2 Web On
The W
as made its
sed to bui
mplete RDF t
uga.edu/.../AC
uga.edu /.../SH
Figur
Resource De
eb and can b
s resources to
nd extra rel
e RDF Schem
can be defin
inherit pro
can be appl
ny concepts
ntology Lang
Web Ontolog
place as a s
ild ontolog
triple:
CK><http://o
H3_2>.
re 2: A RDF
escription Fr
be regarded a
o be defined
ationships c
ma can be o
ned as a subc
operties fro
ied to them
from RDFS
guage (OWL
gy Languag
standard ont
gies that a
10
om.cs.uga.edu
F statement d
ramework S
as an extens
d with classe
can be defin
organized in
class of the D
om other pr
m. The mor
S and is discu
L)
ge (OWL), w
tology langu
are explicit
u/.../hasFunct
depicting a tr
Schema (RD
sible knowle
s, properties
ned, which is
a hierarchic
Disease clas
roperties an
re rich Sem
ussed below
which has b
uage of the S
representa
tionalDomain
riple graphic
DF-S) is a s
edge represen
s and values.
s not the ca
cal fashion.
ss. The prop
nd also dom
mantic Web
.
been recomm
Semantic W
ations of t
n>
cally.
step ahead in
ntation langu
With RDF-
ase for RDF
For example
erties in the
main and r
language O
mended by W
Web. OWL ca
terms and
n the
uage.
-S the
. The
e, the
RDF
range
OWL
W3C,
an be
their
in
re
re
O
to
su
ex
p
an
su
as
nterrelationsh
epresent ma
ecommendat
OIL were rev
o be known
ublanguage
xpressive po
Figu
OWL-
ower and it
nd simple co
upporting OW
RDF
s important
hips. The e
achine interp
tions of XM
vised to give
as the Web
groupings (
ower provide
ure 3: Web
-Lite: This i
supports the
onstraints. B
WL-Lite tha
Schema doe
modeling c
expressive p
pretable con
ML, RDF, an
richer know
Ontology La
(shown in F
ed by each g
Ontology La
s the sublan
e community
Because of it
an the other t
es not includ
capabilities.
11
power of OW
ntent on the
d RDF-S. T
wledge repres
anguage (OW
Figure 3) tha
group.
anguage (OW
nguage of OW
y of users w
ts lesser com
two sublang
de a number
OWL has
WL in term
Web is gre
The Web ont
sentation lan
WL). The O
at are differ
WL) and its
WL which i
who require o
mplexity, it
guages.
of features w
been design
ms of provid
eater as com
tology langu
nguage for th
WL languag
rentiated on
three sublan
s having the
only a simpl
is easier to
which are ge
ned to addr
ding faciliti
mpared to e
uages DAML
he Semantic
ge comes in
the basis o
nguages.
e least expre
le class hiera
provide tool
enerally rega
ess many o
es to
arlier
L and
Web
three
of the
essive
archy
ls for
arded
of the
12
shortcomings of RDF/RDF-S. For example, the equality claims about instances can be
made in OWL-Lite which is not possible in RDF Schema. For example,
• SameIndividual(instance, instance)
Equivalence between classes and properties can also be described by OWL-Lite which is
not directly available in RDF-S. For example,
• EquivalentClasses(ClassA, ClassB)
• EquivalentProperties(Property1, Property2)
Additional property modeling capabilities are available in Owl-Lite. For example,
one property can be inverse of the other property. A property can be transitive, symmetric
or functional. Functional property is a case of maximum cardinality of 1 for the value of
that property. For example, a person can have one biological mother so (ObjectProperty
hasBiologicalMother Functional). There can be many cardinality constraints that are
allowed in OWL-Lite but these are just special cases of more generalized constraints
allowed in the more expressive sublanguage, OWL-DL.
OWL-DL: This sublanguage of OWL has greater expressive power than OWL-
Lite and gets its name, DL, because of its use of Description Logic. This sublanguage of
OWL supports the user community which requires a highly expressive language that is at
the same time computationally complete and decidable. The computational completeness
means that all computations can be completed in finite time and decidable means that all
conclusions are computable.
OWL-DL further extends the advantages of OWL-Lite over RDF-S. There are
more powerful language constructs than those available in OWL-DL. The equivalence
axiom discussed in OWL-Lite can be stated more strongly with the disjoint construct. For
13
example, DisjointClasses(ClassA, ClassB) expresses that both classes are disjoint. OWL-
DL removes the restriction on cardinality constraints and we can provide general
cardinality constraints, with values allowed from 0 to n, as compared to only 0/1 values
allowed in OWL-Lite. So an instance of the Parent class can have a property
hasChildern, with cardinality allowed more than 1. With OWL-DL we can use arbitrary
algebraic expressions and it provides the capability of disjunction, conjunction and
negation. Some examples of these functionalities are shown here:
• ComplementOf(InstanceX, InstanceY) - Negation
• SubClassOf(ClassA UnionOf(ClassC,ClassD)) - Disjunction
• SubClassOf(ClassA IntersectionOf(ClassC,ClassD)) - Conjunction
OWL-Full: This sublanguage of OWL has the maximum expressive power among
all OWL sublanguages and is preferred by the user community who want maximum
syntactic freedom of RDF, although it may not be guaranteed for computational
completeness. It is not possible to do automated reasoning on OWL-Full ontologies.
RDF-S is tolerant in the sense that it allows a class to be instance of another class
and even a class to be instance of itself. OWL-Full is similar in providing this capability
that is otherwise not available in OWL-DL and OWL-Lite. However, the practical
realization of OWL-Full is not available to provide different conceptualizations defined
in it in terms of reasoning tools. A snapshot of ontology fragment in OWL is shown in
Figure 4.
Which sublanguage of OWL is to be used depends upon the requirements of the
user. If more expressive power is required, then OWL-DL may be preferable to OWL-
14
Lite. Owl-Full may be advisable when users require even more meta-modeling
capabilities.
<owl:Thing rdf:about="#CDK11-G-helix">
<rdf:type rdf:resource="#SubDomainX"/>
<rdfs:label xml:lang="en"
>CDK11 G-helix subdomain</rdfs:label>
<hasEndLocation>236</hasEndLocation>
<hasSubDomainSequence>CIFAELLTSEPIFHC</hasSubDomainSequence>
<hasStartLocation>222</hasStartLocation>
<rdfs:comment xml:lang="en"
>This is a Protein Kinase Gene CDK11 's sub domain part G-helix</rdfs:comment>
</owl:Thing>
Figure 4: An ontology fragment in OWL (An excerpt from ProKinO OWL file).
2.1.3 SPARQL (SPARQL Protocol and RDF Query Language)
SPARQL pronounced as “Sparkle” is a query language for the Semantic Web
which is used to query sets of RDF graphs. We can also use it to query with OWL, e.g.
using ARQ. ARQ is a query engine for Jena that supports the SPARQL RDF Query
language. SPARQL is specification provided by the working group of World Wide Web
Consortium (W3C) known as RDF Data Access Working Group (DAWG). At the start of
the year 2008, SPARQL became an official recommendation of the W3C as a standard
ontology query language. SPARQL allows one to fetch values from structured and semi-
15
structured data, the capability to discover data by querying unknown relationships and
execute composite joins of different databases in a query [23].
The Structured Query Language (SQL) is used to query data from a relational
database and SPARQL resembles SQL to a degree. Although a triple store and a
relational database are basically different, SPARQL has been made similar to SQL to
facilitate the developers familiar with SQL. The relations in the form of tables are the
basis of storing data in relational databases and foreign keys are used to relate the data. In
the case of RDF, URIs are used to link to any other data in any triple store.
SPARQL queries are usually run on endpoints which take queries as input and
produce results in different formats, such as XML, RDF, and HTML. A SPARQL
endpoint is a conformant SPARQL protocol service that enables users (human or other)
to query a knowledge base via the SPARQL language and having results returned in one
or more machine-processable formats [24]. The endpoints for SPARQL are usually
classified as Generic endpoints which can query any Web-accessible RDF data (for
example, dataset from FOAF or from our ProKinO) and Specific endpoints that are meant
for particular datasets (for example, Virtuoso’s DBPedia SPARQL Endpoint is an
endpoint available specifically and tailor-made only for a dataset of DBPedia). Generic
endpoint examples are:
• sparql.org (by HP's ARQ)
• OpenLink's Virtuoso
• Redland's Rasqal.
SPARQL queries are called Graph Patterns because to get data extracted from the
triple store using SPARQL, we define a pattern that matches the statements in the graph.
A
sn
ca
A SPARQL q
napshot depi
Figure 5: A
This q
ancers that g
PRE
SELE WHE
query input
icted in Figu
A view of Op
query in the
gene ABL1 is
EFIX prokino:<
ECT ?Disease
ERE { prokino
}
in Open Lin
ure 5.
pen Link’s V
SPARQL la
s associated
<http://om.cs.u
o:ABL1 prokino
16
nk’s Virtuos
Virtuoso Gen
anguage show
with in ProK
uga.edu/....../Pr
o:associatedW
so, a Generi
neric Endpo
wn below is
KinO:
rokino/>
With ?Disease.
ic Endpoint,
int (A query
used to find
is shown in
y for ProKinO
d all types of
n this
O).
f
17
2.2 Ontologies
The term ontology in Computer Science has its origin lying in a branch of
Philosophy called Metaphysics where it is defined as a systematic study of existence [25].
One of the simplest and frequently used definitions of Ontology is given by Thomas
Gruber in his extensively cited paper "Toward Principles for the Design of Ontologies
Used for Knowledge Sharing". He defines ontology as an explicit specification of a
conceptualization [26]. World Wide Web Consortium (W3C) has defined ontology in the
following words: “An ontology defines the terms used to describe and represent an area
of knowledge. Ontologies are used by people, databases, and applications that need to
share domain information (….) Ontologies include computer-usable definitions of basic
concepts in the domain and the relationships among them (…). They encode knowledge in
a domain and also knowledge that spans domains. Ontologies are considered to make
that knowledge reusable” [27].
The reasons for building ontologies in a particular domain are to make common
understanding of structure sharable among different stakeholders in the domain, to enable
reuse of the domain knowledge, to explicitly describe the domain, delineate the domain
knowledge from operational commitment and to explore the domain knowledge [28]. The
ontologies are used as a tool of data integration and knowledge representation in many
disciplines such as the Semantic Web, Artificial Intelligence, Natural Language
Processing, Information Retrieval, Bio-informatics, Software Engineering, Education,
and so forth.
Building ontology is not a trivial task, especially if the ontology has to be truly
representative of the domain of discourse. The task requires relevant data to be extracted,
18
formalized and then integrated into the ontology. Furthermore, the ontology has to be
accepted by the user community [29]. Ontologies are made up of classes, properties and
individuals to make up a foundation for representing the knowledge.
• Classes in the ontologies represent the main concepts in the field of interest. The
example of a class in a biomedical ontology can be a Mutation class which captures
various mutations in genes as a concept and the specific mutations become the
instances of the Mutation class. There can be a hierarchy of further subclasses under
parent classes for more specialized representation of the domain of discourse, which
is usually described as a taxonomic hierarchy.
• The concepts in themselves are not able to fully describe the domain and we need
more information to describe the internal organization of concepts. The properties in
ontologies are used to capture the various features and characteristics of the concepts
represented in the form of classes. The properties can be used for further
characterizing the instances of classes or showing the relationships between instances
of various classes. For example, the Mutation class can have a property
hasPrimarySite to show that a particular mutation has a primary site of cancer
occurrence, such as “skin” and at the same time can have a property foundIn to show
its relationship with an instance of another class, Gene.
• Individuals are used to represent elements or entities in an ontology. Defining an
individual instance of a class requires selecting a class, forming an individual instance
of that class, and loading the instance values. For example, we can create an
individual instance carcinoma to represent a specific type of Cancer.
19
Ontologies that are built using OWL-Lite or OWL-DL can be processed by a reasoner
which is also known as classifier. The main functionality provided by the reasoner is to
check whether or not one class is a subclass of another class. These checks can be useful
in determining the inferred ontology class hierarchy. These classifiers can also be used
for checking the consistency of the ontology. Depending upon the conditions described
for a class, the classifier can check whether or not it is feasible for the class to have any
instances. If it is not possible, the class is considered inconsistent by reasoner.
The ontologies are classified into various categories by the stakeholders using
different criteria. The classification can vary in scope and description, level of hierarchy
and the level of formalism. An upper ontology, also known as a top-level ontology,
foundation ontology or generic ontology, is an ontology which describes very general
concepts and that can be applicable to several domains. This type of ontology is created
to support very broad semantic interoperability between a large numbers of ontologies
accessible under it. The Suggested Upper Merged Ontology (SUMO) [30] and its domain
ontologies form the largest formal public ontology that are being used for research and
applications in search, linguistics and reasoning. Basic Formal Ontology (BFO) [31]
grows out of a philosophical orientation and is focused on the task of providing a genuine
upper ontology which can be used in support of domain ontologies developed for
scientific research, as for example in biomedicine within the framework of the OBO
Foundry. IDEAS [32] is a formal, higher-order, 4D (four dimensionalism) upper ontology
which is extensional (see Extension (metaphysics)), using physical existence as its
criterion for identity. This ontology is well suited to managing change over time and
identifying elements with a degree of precision that is not possible using names alone.
20
Domain ontologies are developed to represent the knowledge about specific domains like
biomedical ontologies. Protein ontology (PRO) is an example of the domain ontology.
The ontology can also be a Task ontology used to accomplish certain tasks. Application
ontologies are used to extend specific applications.
There are various criterions used for building ontologies but all culminate into
mainly two broad approaches. One approach is to reuse the existing ontologies available
in the public domain to share already captured knowledge using variety of techniques.
One technique of reusing ontologies is to include one ontology into the other in the form
of a combined ontology while the other method is to refine a generic ontology to fulfill
the specific needs of the new ontology. So, if possible, the approach is always to consider
reusing the existing relevant ontology instead of duplicating the effort and increasing the
complexity for user community. However, if this is not the case, the best approach is to
build ontology from scratch. In our case as ontology was built to serve a specific domain
of protein kinases that is not covered by any of the already existing ontologies, ProKinO
followed the approach of building ontology by acquiring the data from the disparate
sources and representing the integrated knowledge in it.
Different ontology design principles have evolved over time and are
conventionally followed by developers to produce good quality ontologies. As ontologies
are shared within the domain communities as well as outside, these design principles, if
followed, result in more acceptable and useful ontologies. The ontologies should be
collaborative, extensible, portable, user friendly, modifiable, and well documented.
21
2.3 Biomedical Knowledge Management
Many knowledge experts make a distinction among data, information and
knowledge in the literature. The data alone may not carry much significance, but when
the same data is given some conceptual context and processed in some meaningful way to
provide useful results, it is termed as information. The knowledge is what we achieve
when some understanding is given to the information. Biomedical scientific data is
available in abundance but to manage the large amount of data, we have to apply the
process of biomedical knowledge management. The Knowledge Management is defined
as a method of gathering, processing, organizing, storing and analyzing information about
a particular domain and then further disseminating this knowledge in the form of useful
applications.
The integration of the data and representation of the data as knowledge is an
essential part of the process of Knowledge Management. Data about a particular domain
may be scattered across the different datasets and knowledge could be inherent in the data
integration and further analysis of that data. For example, data about a certain mutation
related to a gene may be available but extracting the knowledge about the sequence (that
may be present in gene’s sequence data) in which this mutation occurs, may require
processing and inference to produce this piece of knowledge.
Today, computers play a far greater role in acquiring, processing, organizing,
integrating, storing and distributing the knowledge than ever before. The data integration
from different sources is not an easy task as there can be disparity in the definitions of
schema of these sources as well as in the naming conventions [33]. The tools for
managing biomedical knowledge can be based on technologies used in the Semantic Web
22
that work on the mission of dealing with heterogeneous resources. Ontologies, one of
these technologies and also seen as a central component of the Semantic Web, play a
central role in integrating knowledge. The ontologies support data integration by ways of
Warehousing and Mediation approaches of integrating data [34]. In the Warehousing
approach to data integration a common vocabulary is constructed by standardizing the
data integration from different sources with common format transformation, whereas in
Mediation approach a global schema is designed and mapped to local schemas of sources
[35]. How knowledge is represented also plays a significant role in managing the
knowledge. One of the more popular approaches of representing knowledge is to use
ontologies. Biomedical ontologies are controlled vocabularies for shared use across
different biological and medical domains and can be seen as the tool of data integration
and knowledge representation for managing biomedical knowledge. ProKinO is a
specialized biomedical ontology in this effort, which manages protein kinase domain
knowledge by integrating data about the domain, as well as representing the knowledge
created from this integration of data.
2.4 Biomedical Ontologies
The biomedical ontologies are seen as an answer to the challenges in seamless
integration of data from heterogeneous sources using different schema and naming
methods. They are a very important tool of data integration of basic and clinical research
and representing the knowledge benefitting biomedical communities. The effort that was
initiated by the development of small ontologies using some rudimentary tools by experts
23
in restricted domains of interest has now advanced into a scenario in which large popular
ontologies are being built by groups of people from various disciplines.
Many endeavors have taken place in the field of biomedical ontologies and
National Center for Biomedical Ontology (NCBO) [36] is one of the efforts dedicated to
this cause. NCBO was formed in 2006 and was funded by National Institutes of Health
(NIH) [37] and a component of National Centers for Biomedical Computing. It has been
described as a consortium of world-class researchers committed to developing technology
and infrastructure that boost biomedical research. The stated vision of NCBO says that all
biomedical knowledge and data are disseminated on the Internet using principled
ontologies in such a way that the knowledge and data are semantically interoperable and
useful for furthering biomedical science and clinical care. They also define their mission
as to create software and support services for the application of principled ontologies in
biomedical science and clinical care, ranging from tools for application developers to
software for end-users. NCBO is involved in the development of ontologies from the
OBO family. Other ontology related research is being conducted both in Europe and the
US. The National Center for Ontological Research (NCOR) was established by The
University at Buffalo and the Stanford University by partnering with a number of
institutions drawn from academia, government, and industry and it is working for the
advancement of ontological investigation and focusing on the development of tools and
procedures for quality assurance of ontologies [38]. Also in place is The European Center
for Ontological Research (ECOR) that was founded at the Saarland University in
Saarbrücken, Germany (in 2004), and represents a new approach in applying ontology to
a variety of problems [39].
24
2.5 Biomedical Ontologies Applications
Biomedical ontologies are becoming increasingly popular among the biomedical
community with every passing day. Biomedical ontologies are being developed in all
possible types of ontologies including very generic ontologies capturing common high
level concepts, mid-level biomedical ontologies, domain-specific ontologies, and task
specific ontologies. We can find examples of all these types of ontologies in the form of
well established biomedical ontologies available in the public domain. There have been
many collaborative efforts in the field of biomedical ontologies going on to serve the
community. The Open Biomedical Ontologies Foundry collaborative effort and one of its
most significant ontology, the Gene Ontology (GO), are discussed below:
2.5.1 Open Biomedical Ontologies (OBO) Foundry:
The OBO foundry is an open, inclusive and collaborative platform provided to all
those biological researchers and biomedical domain community members who are
involved in the design, development and publishing of ontologies serving the domain.
The goal of OBO is to develop a set of interoperable reference ontologies, validated by
humans, for all major domains of biomedical research [40]. In 2001, an umbrella body
named Open Biology Ontologies Foundry was formed by Ashburner and Lewis to serve
the community of developers and publishers of life-science ontologies. This started as a
joint effort in which a group of OBO ontology developers agreed in advance to embrace
an emerging set of principles spelling out best practices in the development of biomedical
ontologies. These principles require that the ontologies (i) are open and available to be
used by all without any constraint, (ii) can be expressed in a common formal language,
25
(iii) use relations which are unambiguously defined, (iv) provide procedures for user
feedback and for identifying successive versions, and (v) are developed in a
collaborative effort [41].
As of this writing, OBO foundry contains 8 ontologies under the “OBO foundry
candidate ontologies” category and 86 under the “Other ontologies and terminologies of
interest” category and the list continues to grow. OBO is supported by the National
Centre for Biomedical Ontology (NCBO) through its BioPortal [42]. These ontologies are
becoming widely accepted as a reference in the biomedical domain. There is a continuous
effort from the OBO foundry to improve effectiveness and transparency, and facilitate
interactions among ontology communities in order to help in providing semantic
integration of biological data from the multiplicity of sources generating data and in
promoting reasoning for information extraction from these data.
2.5.2 NCBO BioPortal
Recently, a very useful resource has been made available in the public domain for
accessing biomedical ontologies via Web services and Web browsers for ontologies
developed in OWL, RDF, OBO format and Protégé frames. This tool is provided by
BioPortal (http:// bioportal.bioontology.org) [43], funded by National Institutes of Health
[37], that facilitates the browsing, searching and visualizing ontologies and acts as a
repository of biomedical ontologies. Many Web 2.0 features have been incorporated in it
to make the system behave as a comprehensive ontology repository. The resource is very
useful in the context because it also provides a web interface to support community-based
participation which can play a greater role in evolution and evaluation of ontologies.
26
BioPortal also provides features for the community to add notes to ontology terms,
mappings between terms and ontology reviews based on criteria such as usability,
domain coverage, quality of content, and documentation and support. BioPortal is
described as ‘one-stop shopping’ that helps not only to programmatically access
biomedical ontologies, but also for providing support to integrate data from a variety of
biomedical resources.
BioPortal contains a large number of ontologies in its repository and the list is
growing rapidly. In March 2008, the repository included 72 ontologies (300, 000 total
classes) which almost doubled within one year to 134 ontologies (680, 000 total classes).
That number has further grown to include 193 ontologies (as April 2010). BioPortal
contains ontologies on a number of subjects, such as anatomy, phenotype, experimental
conditions, imaging, chemistry, health and many more. Metadata collected for each
ontology include keyword terms, text descriptions, version information, release date,
ontology author contact information and links to documentation and the ontology content
and metadata can be updated automatically or by user submission [42]. In addition
BioPortal provides Web services and Web interface from which prior ontology versions
can be accessed and downloaded.
2.5.3 Gene Ontology (GO)
The Gene ontology (GO), is an immensely popular ontology in the biomedical
field. The Gene Ontology project [44] describes its project as a major bioinformatics
initiative with the aim of standardizing the representation of gene and gene product
attributes across species and databases. The GO ontology is used to capture the
kn
it
co
2
by
ph
fu
ce
ab
p
m
tr
nowledge ab
t addresses th
Our e
onsidered as
.6 Informat
Protei
y chemical
hosphorylati
unctional ch
ellular locat
bout 518 pro
ercentage of
majority of c
ransduction.
bout biologi
he need for c
endeavor of
s a bio-ontolo
tion Sources
in Kinase: A
lly adding
ion, as dep
hange of the
tion, or asso
otein kinase
f human pro
cellular path
Signal trans
Figure 6: Pr
cal processe
consistent de
f developin
ogy applicat
s in Protein
A “protein ki
phosphate
picted in Fi
e target pro
ociation with
genes and th
oteins may b
hways may
sduction resu
rotein kinase
27
es, cellular c
escriptions o
g a special
tion serving
Kinase Dom
inase” is a ki
groups to
gure 6. T
otein (substr
h other prot
hey constitut
e modified b
be regulate
ults in a chan
es phosphory
components
of gene produ
lized ontolo
protein kina
main
inase enzym
o them an
he effect o
rate) by cha
teins [45]. T
te nearly 2%
by kinase ac
ed, especial
nge in the ce
ylation (adap
and molecu
ucts in differ
ogy, ProKin
ase and relate
me that chang
nd this pro
of this proc
anging the
The human g
% of all huma
ctivity (up to
lly those inv
ell behavior.
pted from [4
ular function
rent databas
nO, also ca
ed communi
ges other pro
ocess is c
ess is usua
enzyme act
genome con
an genes. A
o 30% of all
volved in s
46]).
s and
es.
an be
ities.
oteins
called
ally a
tivity,
ntains
large
l) and
signal
28
Highly regulated activity of protein kinases has big effects on a cell. By
controlling protein kinases’s location in the cell relative to their substrates, or by binding
of activator proteins or inhibitor proteins, or small molecules, phosphorylation may turn
them on or off. Deregulated kinase activity is a frequent cause of disease, particularly
cancer, where kinases regulate many aspects that control cell growth, movement and
death. Drugs which inhibit specific kinases are being developed to treat several diseases,
and some are currently in clinical use, including Gleevec (imatinib) and Iressa (gefitinib)
[47] .
In the public domain, there are many database resources available containing
information related to the protein kinase domain. Our ontology, ProKinO, is populated
with data extracted from a number of different publicly available biological database
resources. As mentioned earlier, these sources in the protein kinase community include:
KinBase, Catalogue of Somatic Mutations in Cancer (COSMIC), Protein data Bank
(PDB), protein families database (Pfam), and The Universal Protein Resource (UniProt).
These sources contain a large amount of information about the domain of discourse but
they are stored in different formats and are using different concept specifications.
KinBase: This is an interactive kinase database available at Kinase.com, which
made an effort to provide a platform for a broad analysis of protein phosphorylation in
normal and disease state. The KinBase has defined all kinases into a hierarchy of
classification divided into protein kinase groups, families and subfamilies. This
classification is helpful in evaluating the kinase function and growth by comparison of
related kinases.
29
COSMIC: The Catalogue of Somatic Mutations in Cancer (COSMIC) is a
database dedicated to hosting the information about the somatic acquired mutations
relating to human cancers. A huge amount of information about mutations is available in
the published scientific literature and COSMIC is a project to combine together
information about publications, samples and mutations in one place.
Pfam: The Protein Family Database (Pfam) is a comprehensive database for
conserved protein families, extensively used by the researchers in the biomedical domain
consisting of collection of multiple sequence alignments and profile hidden Markov
models (HMMs). There are many proteins available in nature as domains mix in different
ways to give a wide range of results. The function of proteins can be better understood
with the identification of domains within these proteins and Pfam HMM is regarded as a
great source in this regard. Each Pfam HMM embodies a protein family or domain.
UniProt: The Universal Protein Resource (Uniprot) is considered as a
comprehensive catalogue of protein sequences and functional annotations. UniProt is a
freely and easily accessible global resource for storing and interconnecting information
from voluminous and disparate sources. The researchers and scientists are greatly helped
by UniProt in conducting interactive and custom-tailored analyses of proteins of interest.
PDB: The Protein Data Bank (PDB) acts as a comprehensive source of three-
dimensional structures of macromolecular complexes of proteins, nucleic acids, and other
biological molecules. The knowledge about the shape of molecules provided by PDB can
be used to understand a structure's role in human health, disease and in drug
development.
30
The information is extracted from these diverse sources, and then populated into
ProKinO. This includes the protein kinase genes information about their classification in
groups, families and subfamilies, their sequences, FastaFormat, chromosomal position
and the mutation information about the mutations associated with protein kinase genes,
mutation primary sites, primary histology, mutation amino acid, mutation description,
mutation residue, and cancer types. The information about the functional features, such as
the modified residue, signal peptide, topological domain, cellular location, tissue
specificity as well as about the crystal structures and the identification and classification
of functional domains in protein kinase sequences is also captured and represented in
ProKinO.
2.7 Challenges in Integrating Protein Kinase Knowledge
The biological researchers and scientists use specific theories and models for the
data related to their work in the domain of interest and often end up applying different
descriptions of the same data [48]. There has been an exponential growth in the research
and clinical data resulting in many isolated repositories of knowledge designed,
developed and maintained by separate groups working in the same or related area of
interest. Whenever a need makes obligatory for a user to access the information from
these disparate sources, many challenges are posed for utilizing knowledge stored in
them. The researcher has to make customized arrangements for fetching data from
required sources and integrating for the current need fulfillment and later on any new
query will require going through the adaption again. The protein Kinase knowledge is
al
so
as
se
fu
fe
d
U
d
re
lso not avai
ources often
For e
ssociated w
equence of
unctional fea
etching infor
atabase for
UniProt for f
ifferent form
equire custom
ilable from
n in heterogen
example, if
with a partic
the gene in
ature (for ex
rmation from
gene and s
functional fe
mats that m
mized proce
Figure 7: C
a single uni
neous data f
we have a
cular mutati
n question i
xample, tissu
m the COSM
sequence da
eatures. The
may be using
essing for ans
Challenges in
31
ified source,
formats.
a complex q
ion and als
is needed al
ue specificit
MIC databa
ata, from th
e fetched inf
g different
swering the
n integrating
, and has to
query to fin
o informati
long with th
ty) of this g
se for muta
he Pfam for
formation fr
nomenclatur
query, as sh
g protein kin
o be collecte
nd the prot
ion about th
he functiona
gene. This qu
ation data, fr
functional
rom these so
re for same
hown in Figu
nase knowled
ed from diff
tein kinase
he structure
al domain a
uery will re
rom the Kin
domain and
ources will b
e entity and
ure 7.
dge
ferent
gene
e and
and a
equire
nBase
d the
be in
d will
32
We can very well imagine the challenges considering the voluminous data and
numerous queries that may have to be handled. We will see in later sections how a well
populated ProKino, containing knowledge from various sources, can be helpful in dealing
with these kinds of challenges.
33
CHAPTER 3
DESIGN OF PROKINO
In this chapter we discuss the design of ProKinO, starting with looking at the
heterogeneous knowledge sources used in ProKinO development such as Kinbase,
COSMIC, Protein data Bank, Pfam and UniProt and then moving to the ProKinO design
discussing ProKinO classes and properties. Finally, we describe the overall architecture
of ProKinO, describing all its components at the end of this chapter.
3.1 Heterogeneous Knowledge Sources
The ProKinO ontology is an attempt to provide a common vocabulary of concepts
and relationships between those concepts about protein kinase domain. Our goal has been
to make ProKino a unified compendium of knowledge, captured from disparate
knowledge sources related to the domain of protein kinase. Here we discuss briefly the
protein kinases and the sources of protein kinase related knowledge, as well as the
information they contain.
3.1.1 KinBase
One of the well recognized sources in the domain of the protein kinases is an
interactive kinase database KinBase from Kinase.com. Kinase.com is produced and
managed by Gerard Manning's lab at the Salk Institute in California. Sucha Sudarsanam,
34
of the biotech company Sugen developed Kinase.com in 1999 to support the publication
of Sugen's analysis of the protein kinases. The site was designed and enlarged with
several bioinformatics tools to use the data by Jonathan Bingham of Sugen. The site was
further developed in 2002-2003 to support KinBase, an interactive kinase database. The
system was shaped and administered by Glen Charydczak and was designed by Gerard
Manning [49].
There has been a non-redundant set of 518 human protein kinase genes identified
based on a comprehensive approach using human genome analysis in KinBase. This
collection in KinBase is made up of published human genome sequences as well as of
other sequence databases and also including directed cloning and sequencing of
individual genes. The set of protein kinase genes in KinBase includes most human
members of the eukaryotic protein kinase super family, and many atypical kinases and
almost all human protein phosphorylation [50]. A popular poster of human kinome by
KinBase is shown in Figure 8.
Figure 8: Human kinome poster (source: [51]).
35
The collection of eukaryotic protein kinases is categorized into ten groups. The
classification of protein kinases into groups by KinBase is depicted in Figure 9.
Figure 9: The Protein kinases classification in groups by KinBase (Atypical and Other groups not shown) [50].
The ten groups are classified in KinBase as:
• AGC group (including cyclic-nucleotide and calcium-phospholipid-dependent
kinases, ribosomal S6-phosphorylating kinases, G protein- coupled kinases and close
relatives of these kinases)
• Atypical group
• CAMKs (calmodulin-regulated kinases)
• CK1 group (casein kinase 1 and close relatives)
• CMGC group (including cyclin-dependent kinases, mitogen-activated protein kinases,
CDK-like kinases and glycogen synthase kinase)
• RGC group (receptor gyanylate cyclase kinases)
• STE group (MAP Kinase cascade kinases),
• Tyrosine kinase group (TKs) and
36
• TKL group (Tyrosine kinase like family) - which are a cluster of serine-threonine
kinases resembling TKs.
• Other group: Another extensive, miscellaneous group called 'Other' is also considered
for those proteins that do not fit in any of the predefined sets.
These protein kinase groups are further subdivided into families and sub families in KinBase. This distribution into groups, families and subfamilies is shown in Table 1.
Table 1: Kinase distribution by major groups in human systems (Source & Adaption:[50])
Groups Families Sub Families Human Kinases
AGC 14 21 63
CAMK 17 33 74
CK1 3 5 12
CMGC 8 24 61
Other 37 39 83
STE 3 13 47
Tyrosine Kinase 30 30 90
Tyrosine Kinase-Like 7 13 43
RGC 1 1 5
Atypical-PDHK 1 1 5
Atypical-Alpha 1 2 6
Atypical-RIO 1 3 3
Atypical-A6 1 1 2
Atypical-Other 7 7 9
Atypical-ABC1 1 1 5
Atypical-BRD 1 1 4
Atypical-PIKK 1 6 6
Total 134 201 518
The sequences of all these protein kinase genes in KinBase database are also
available from kinase.com [52]. The data is extracted from this well recognized and the
fundamental source of information on protein kinases.
37
3.1.2 COSMIC (Catalogue of Somatic Mutations in Cancer)
Catalogue of Somatic Mutations in Cancer abbreviated as COSMIC is the largest
resource for information about the somatic acquired mutations associated with human
cancer used by the biomedical community and available freely in the public domain [53].
Mutations are changes in a DNA sequence caused by transcription inaccuracies in genetic
material, induced cellular processes by the organism itself, damage due to mutagenic
chemicals or viruses or radiations, or errors during meiosis. The mutations are mapped to
a single version of each gene sequence in COSMIC and somatic mutation frequencies are
also made available.
After being launched in 2004, there has been a gigantic increase in the size of the
COSMIC database. Presently (v43, August 2009), COSMIC is said to be holding the
details of 1.5-million experiments performed through 13423 genes in almost 370000
tumours, describing over 90,000 individual mutations. The main two sources of data
inclusion in COSMIC are the publications in the scientific literature and the output from
the large analyses from the Cancer Genome Project (CGP) at the Wellcome Trust Sanger
Institute, UK [54]. This has been stated that most of the world’s literature on the genes
known to have tumor-promoting point mutations is curated in COSMIC [54]. The
curation of fusion events (when two genes fuse together) between multiple genes, which
are also common occurrences in cancer, is also continually updated.
One of the important studies in cancer research is finding the mutated genes that
are implicated in cancer. The COSMIC database is integrated into ProKinO for capturing
the mutation knowledge about genes related to protein kinase so that we can relate
otherwise heterogeneous information to be available in a single integrated representation.
38
3.1.3 Pfam (Protein Families Database)
Pfam, a wide-ranging database for conserved protein families, was originated in
1998 and since then has been growing with deposition of new protein sequences. Pfam
which is a collection of multiple sequence alignments and profile hidden Markov models
(HMMs) is publically available via servers in the UK [55], Sweden [56] and USA [57].
Pfam’s latest release contains a collection of nearly 12000 families each
represented by multiple sequence alignments and hidden Markov models. There are
domains which are functional regions that form the proteins. Entries in Pfam can be
families which are sets of related proteins; domains which are structural units found in
many protein frameworks; repeats, which are units forming a steady structure when many
copies are there but unsteady in isolation, and motifs which can be a short unit found
outside globular domains. A clan is a higher level grouping of related families, produced
by Pfam, that has come up from a single evolutionary source. The resemblance in tertiary
structures, or, in common sequence motifs when structures are not available give their
evolutionary substantiation.
Pfam families can be categorized in two levels of quality: Pfam-A and Pfam-B.
Entries coming up from the core sequence database, known as Pfamseq, which is built
from the most recent release of UniProtKB [58] at a specific time, are termed as Pfam-A
entries. To cover more known proteins alongwith Pfam-A families, Pfam generates Pfam-
B families using the Automatic Domain Decomposition Algorithm (ADDA) database
[59]. Pfam-B families have no associated annotation or literature reference and are of
much lower quality than Pfam-A families, as their alignments have not been manually
checked by a Pfam curator. If we perform a search on protein kinases against the Pfam
39
library of HMMs, we can find its domain architecture (for a protein shown in Figure 10)
i.e., find which domains it carries.
Figure 10: Pfam display of protein domain architecture (source: [57]).
The data is extracted from this source about identification and classification of functional
domains in protein kinase sequences to be captured in ProKinO.
3.1.4 Protein Data Bank (PDB)
The Protein Data Bank (PDB) contained 7 structures to start with when it was
founded in 1971, at the Brookhaven National Laboratory. After that, PDB has grown by
large proportions to become a large repository for three-dimensional structures of
macromolecular complexes of proteins, nucleic acids, and other biological molecules.
The knowledge about the shape of the molecules of life is useful in understanding their
working. This knowledge can go a long way in deducing a structure's role in disease and
developing the drugs for these diseases.
There was another step in this connection when The Worldwide Protein Data
Bank (wwPDB) [60] was formed in 2003 to keep a single PDB archive of
macromolecular structural data which is freely and openly available to the global
40
community. There are many participating organizations of the wwPDB. These
organizations are basically the centers that collect, process and distribute the PDB data.
wwPDB was joined by The Biological Magnetic Resonance Data Bank (BMRB) [61] in
the year 2006. Many utilities are extended by wwPDB with providing databases and
websites that can be helpful for obtaining different views and analysis of the structural
data contained within the PDB archive.
PDB has many tools and resources available for searching based on annotations
relating to sequence, structure and function and then using the results further for analysis
or visualization. The curation and annotation of PDB data is done as per the agreed
principles by the PDB that supports a website to perform simple and complex queries on
the data, analyze, and visualize the results. The crystal structure of a protein kinase gene
product EGFR (whose PDBId is given by the property hasStructure in ProKinO) is
shown in Figure 11.
Figure 11: Crystal structure of protein kinase gene product EGFR (ProKinO PDBId:
1IVO, source: [62]).
41
Information is extracted from the Protein Data Bank to be included in ProKinO
about the crystal structures related to protein kinases and PDB database references for
that structure.
3.1.5 The Universal Protein Resource (UniPort)
There has been tremendous growth of the proteomics data and also soaring
production of genome sequencing data in the recent past. This has resulted in a huge
wealth of protein sequences and associated data for a large number of organisms. There
was an observed need for a kind of storehouse that can serve as a global collection of
protein sequences with all-inclusive coverage and a methodical approach to protein
annotation, incorporation, integration and standardization of data from the various
sources and UniProt was an initiative in this direction [63].
The Universal Protein Resource (UniProt) is supported by the UniProt
Consortium, a collaboration between the European Bioinformatics Institute (EBI), the
Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR).
UniProt is mainly supported by the National Institutes of Health (NIH). The mission of
UniProt is to offer the scientific community a wide-ranging, high quality and freely
accessible resource of protein sequence and functional information.
The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt
Archive (UniParc), The UniProt Metagenomic and Environmental Sequences (UniMES),
and the UniProt Reference Clusters (UniRef). UniProt Knowledgebase (UniProtKB) is
the database for broad curated protein information, including function, classification, and
cross-reference. UniProtKB is further made up of UniProtKB/Swiss-Prot and
42
UniProtKB/TrEMBL. The first one is manually annotated and is reviewed and second
one is automatically annotated and not reviewed. Sequences and their identifiers are
stored in the UniProt Archive (UniParc). The UniProt Reference Clusters (UniRef)
databases help to obtain complete coverage of sequence space at several resolutions while
hiding redundant sequences by presenting clustered sets made of sequences from the
UniProtKB and selected UniProt Archive records [64]. The UniProt Metagenomic and
Environmental Sequences (UniMES) database is for metagenomic and environmental
data. These databases are shown in Figure 12.
Figure 12: The Universal Protein Resource (UniProt) databases [65].
Significantly, the UniProt consortium in 2008 completed the first draft of the
complete human proteome in UniProtKB/Swiss-Prot, thereby providing manually
annotated representation of all presently known human protein-coding genes in UniProt
release 14.0 with 20325 entries [63].
43
Information extracted from UniProt and represented in ProKinO includes
functional features, such as the modified residue, signal peptide, topological domain,
cellular location, tissue specificity, organism, and database cross references for
Wikipedia, PubMed, MEDLINE and UniProt.
3.2 ProKinO Design
Ontology development is a complex process involving a series of systematic and
well organized steps in the process. This is not merely a technological process but
requires a deep understanding of the area about which the knowledge is being represented
in the ontology. The ProKinO realization includes following certain principles for its
design and development. One important aspect of ontology design is to keep in mind the
interests of the intended user community and that the design must incorporate the needs
of user in its final requirements.
In the case of ProKinO, the golden rule of reusing the existing ontologies in the
public domain for specialization or generalization was considered. However, we
recognized the need for a knowledge repository serving the specific domain of protein
kinase due to their important role in diseases like cancer. The domain expertise was
provided by the Evolutionary Systems Biology Group (ESBG) in the Biochemistry and
Molecular Biology department at The University of Georgia that is researching
extensively in the area of protein kinases, with the help of the protein kinase community.
On the other hand, the technical facilitation was extended by the Large Scale Distribution
Information Systems Lab (LSDIS) in Computer Science department at The University of
Georgia. The specification and conceptualization of ProKinO was finalized after a long
44
deliberation period during which ProKinO has undergone numerous revisions. In the
subsequent sections, we discuss the current specification of ProKinO design. However,
the ontology may be revised in the future, as the requirements of user community may
evolve.
3.2.1 ProKinO Classes
Ontologies are made of components namely classes (slots), properties
(relationships) and individuals (members or instances) to fully capture the knowledge
about a particular domain. The steps in ontology construction include developing the
class hierarchy and defining a number of properties and relationships characterizing the
classes. The main concepts in ProKinO have been defined as major classes and are
shown in Figure 13.
The classes defined in ProKinO are used to represent the sequence, structure,
function, sub-domain, and mutation data. We used a combination of top-down and
bottom-up approaches to first define some main concepts and then specialize or
generalize them, depending on further needs. All the major classes and sub classes
specified in the design of ProKinO are shown in Table 2.
46
Table 2: Classes in ProKinO.
Class Name Sub Class Name (Level-1)
Sub Class Name (Level-2)
Sub Class Name (Level-3)
DbXref - - - The DbXref class is used to represent the various database cross references for the instances stored in the Gene class.
This class contains cross references for the UniProt, PDB, PubMed, MEDLINE, Wikipedia and KinBase databases.
Disease Cancer - - The Disease class represents the different diseases with which protein kinase genes are associated with. The Disease
class have these diseases represented under further sub classes of diseases (Cancer, initially).
FunctionalDomain - - - The FunctionalDomain class represents the functional domains identified and classified in protein kinase sequences.
FunctionalFeature ModifiedResidue
SignalPeptide
TopologicalDomain
TransmembraneRegion
-
-
The FunctionalFeature class represents data about the various functional features related to the protein kinase genes
for which sub classes of ModifiedResidue, SignalPeptide, TopologicalDomain, TransmembraneRegion have been
designed to represent the different groups of functional features of gene.
Gene - - - The Gene class in ProKinO represents all the protein kinase genes which are classified into the protein kinase groups,
families, and sub families.
Mutation - - - The Mutation class is used to represent all the somatic cancer mutations that are linked with the protein kinase genes.
Organism - - The Organism class represents all the organisms (Human, initially) for the protein kinase genes.
ProteinKinase (groupname) group (familyname) family (subfamilyname) subfamily
The ProteinKinase class represents classification of the protein kinase genes by having the groups of genes represented
by the ProteinKinaseGroup as a sub class under it. The family of protein kinase genes is represented by the
ProteinKinaseFamily sub class designed under the ProteinKinaseGroup class. Similarly the sub family of protein
kinase genes is represented by the ProteinKinaseSubFamily sub class designed under the ProteinKinaseFamily class.
Sequence - - - The Sequence class of ProKinO represents the sequences for the protein kinase genes.
Structure - - - The Structure class of ProKinO represents crystal structures of the proteins kinase genes.
SubDomain SubDomain(I to XI) - - The SubDomain class represents the motif data related to the protein kinase genes with sub classes of SubDomainI to
SubDomainXI under it capturing the various motifs related to the genes.
47
3.2.2 ProKinO Properties
Once we have defined the major classes in the ontology, the next step is to
describe additional specifics of the concepts by defining their relationships in the form of
object and data properties in ProKinO. The object properties capture the relationships
between the members of various classes to make the collected information integrated as
per the agreed conceptualization. How a gene in the Gene class is related with a disease
in the Disease class is formalized with the object property associatedWith. On the other
hand, the correlatesTo property connects a Disease member, i.e. a particular disease
(cancer for our purposes), to the gene in the Gene class. These are descriptions of only a
few object properties discussed above. The full list of object properties conceptualized in
ProKinO is given in Table 3.
As we need information to describe the internal organization of concepts captured
in ontology, the data properties in ProKinO are defined apart from object properties.
There are many properties of classes considered, such as what type of values, permissible
values, and cardinality while defining relationships. The property hasFASTAFormat is
defined as a data property in ProKinO to represent the FASTA-formatted sequences of all
protein kinase genes, and it is of data type string. Similarly, there exist another property,
hasOtherName, which is used to store the other names by which a gene may be known in
literature (synonyms). For example, a protein kinase gene ADCK3 has four other names
CABC1, COQ8, MGC4849 and LOC56997. There is a large list of data properties
defined in ProKinO as shown in Table 4.
48
Table 3: Object Properties in ProKinO.
Object Property Domain (Class) Range (Class)
associatedWith Gene Disease The object property associatedWith represents the relationship between the instances of the Gene class with the
instances of the Disease class. This relationship represents how the protein kinase genes are associated with the
different diseases. For example, the gene “ABL1” is associated with cancer disease “giloma”.
contains Sequence Mutation The object property contains represents the relationship between the instances of the Sequence class with the instances
of the Mutation class. This relationship represents which sequences of the protein kinase genes contain which
mutations represented in Mutation class. For example, the gene sequence “Seq-ABL1” (sequence of the gene “ABL1”)
contains mutation “Q252H”.
correlatesTo Disease Gene The object property correlatesTo represents the relationship between the instances of the Disease class with the
instances of the Gene class. This relationship represents how a disease correlates to the protein kinase genes. For
example, the cancer disease “giloma” correlates to the genes “ABL1”, “FYN”, “TBK1” and many others.
foundIn Mutation Gene The object property foundIn represents the relationship between the instances of the Mutation class with the instances
of the Gene class. This relationship represents what mutations are found in which protein kinase genes. For example,
mutation “L387M” (identified by mutation id 12624) is found in the gene “ABL1”.
hasFunctionalDomain Gene FunctionalDomain The object property hasFunctionalDomain represents the relationship between the instances of the Gene class with the
instances of the FunctionalDomain class. This relationship represents which protein kinase gene has which functional
domain. For example, the gene “ABL1” has the functional domains of “Pkinase” and “Pkinase_Tyr”, “F-actin_bind”
and “SH3_1”.
hasFunctionalFeature Gene FunctionalFeature The object property hasFunctionalFeature represents the relationship between the instances of the Gene class with the
instances of the FunctionalFeature class. The FunctionalFeature class further has sub classes such as
ModifiedResidue, SignalPeptide, TopologicalDomain, and TransmembraneRegion. This relationship represents which
protein kinase gene has which functional features represented in these FunctionalFeature sub classes. For example, the
gene “ABL1” has a functional feature of “phosphoserine”.
hasFunctionalRelationship FunctionalDomain Gene The object property hasFunctionalRelationship represents the relationship between the instances of the
FunctionalDomain class with the instances of the Gene class. This relationship represents which functional domain has
a functional relationship with which protein kinase genes. For example, the functional domain “F-actin_bind” has a
functional relationship with the genes “ABL1” and ABL2”.
hasGene ProteinKinase Gene The object property hasGene represents the relationship between the instances of the ProteinKinase class with the
instances of the Gene class. This relationship represents which protein kinase genes represented in groups, families or
49
subfamilies of protein kinases are related with the genes represented in the Gene class.
hasMEDLINEId Gene DbXref The object property hasMEDLINEId represents the relationship between the instances of the Gene class with the
instances of the DBXref class related to the MEDLINE cross references. This relationship represents the protein kinase
genes’s MEDLINE database cross references. For example, the gene “ABL1” has the MEDLINE ids “93101588”,
“95199229” and many others.
hasMutation Gene Mutation The object property hasMutation represents the relationship between the instances of the Gene class with the instances
of the Mutation class. This relationship represents which protein kinase genes are having which mutations. For
example, the gene “ABL1” has the mutations “Q252H”, “R47G” and many others.
hasMutationDbXref Mutation DbXref The object property hasMutationDbXref represents the relationship between the instances of the Mutation class with
the instances of the DBXref class. This relationship represents a mutation’s Mutation (NM) database cross reference.
For example, the mutation “Q456Q” (mutation id 1110) has Mutation database cross reference “NM_004333”.
hasPDBId Gene DbXref The object property hasPDBId represents the relationship between the instances of the Gene class with the instances of
the DBXref class related to the PDB cross references. This relationship represents the protein kinase genes’s PDB
database cross references. For example, the gene “ABL1” has a PDB id “1OPL”.
hasPfamId Gene DbXref The object property hasPfamId represents the relationship between the instances of the Gene class with the instances of
the DBXref class related to the Pfam cross references. This relationship represents the protein kinase genes’s Pfam
database cross references.
hasProteinKinaseDbXref ProteinKinase DbXref The object property hasProteinKinaseDbXref represents the relationship between the instances of the ProteinKinase
class with the instances of the DBXref class related to the protein database cross references. This relationship
represents protein kinase genes’s KinBase database cross references.
hasPubMedId Gene DbXref The object property hasPubMedId represents the relationship between the instances of the Gene class with the
instances of the DBXref class related to PubMed cross references. This relationship represents protein kinase genes’s
PubMed database cross references. For example, the gene “ACK” has the PubMed ids “16641997”, ”17344846”,
”15951569” and many other.
hasSequence ProteinKinase Sequence The object property hasSequence represents the relationship between the instances of the ProteinKinase class with the
instances of the Sequence class. This relationship represents which protein kinase genes under the groups, families or
subfamilies of the protein kinases are having which sequences stored under the Sequence class.
hasStructure Gene Structure The object property hasStructure represents the relationship between the instances of the Gene class with the instances
of the Structure class. This relationship represents which protein kinase gene has which crystal structure. For example,
the gene “ABL1” has a structure “1OPL”;
50
hasSubDomain Gene SubDomain The object property hasSubDomain represents the relationship between the instances of the Gene class with the
instances of the SubDomain class. This relationship represents which protein kinase gene has which sub domains that
are represented under various sub classes (SubDomain I to IX) of the SubDomain class. For example, the gene “ABL1”
has sub domains for all the sub domains from SubDomain I to IX such as for G-helix, the sub domain “ABL1-G-helix”
has a start location of 446, end location of 460 and subsequence “VLLWEIATYGMSPYP”.
hasUniProtId Gene DbXref The object property hasUniProtId represents the relationship between the instances of the Gene class with the
instances of the DBXref class related to the UniProt cross references. This relationship represents the protein kinase
genes’s UniProt database cross references. For example, the gene “ACK” has the UniProt ids “Q07912”, ”Q6ZMQ0”,
”Q8N6U7” and many other.
hasWikipediaId Gene DbXref The object property hasWikipediaId represents the relationship between the instances of the Gene class with the
instances of the DBXref class related to the Wikipedia cross references. This relationship represents the protein kinase
genes’s Wikipedia database cross references. For example, the gene “MUSK” has a Wikipedia id “Musk_protein”.
implicatedIn Mutation Disease The object property implicatedIn represents the relationship between the instances of the Mutation class with the
instances of the Disease class. This relationship represents which mutations are found implicated in which diseases
(Cancer initially). For example, the mutation “L387M” (identified by the mutation id 12624) is implicated in the
disease “haematopoietic_neoplasm”.
locatedIn Mutation Structure The object property locatedIn represents the relationship between the instances of the Mutation class with the instances
of the Structure class. This relationship represents which mutations are located in which structures of the protein
kinase genes. For example, the mutation “L387M” (identified by the mutation id 12624) is located in the
structure”1OPL “.
occursIn Mutation Sequence The object property occursIn represents the relationship between the instances of the Mutation class with the instances
of the Sequence class. This relationship represents which mutations are found in which sequences of the protein kinase
genes. For example, the mutation “L387M” (identified by mutation id 12624) is found in the sequence “Seq_ABL1”
i.e.
"MGQQPGKVLGDQRRPSLPALHFIKGAGKKESSRHGGPHCNVFVE………………………………………………
…………………………….WEMERYTFCVSYVDSIQQMRNKFAFREAINKLENNLRELQICPATAGSGPAATQD
FSKLLSSVKEISDIVQR" (Complete sequence not shown)
51
Table 4: Data Properties in ProKinO.
Data Property Domain (Class) Range (Literal)
chromosomalPosition Gene String The data property chromosomalPosition represents the chromosomal position of the protein kinase genes in the Gene
class and it takes the string value. For example, the gene “ABL1” has chromosomal position “9q34.2”.
hasCancerType Mutation String The data property hasCancerType represents types of cancers, a particular mutation in the Mutation class has and this
property takes the string value. For example, the mutation “S605F” (mutation id 1135) has a cancer type “malignant-
malenoma”.
hasCellularLocation Gene String The data property hasCellularLocation represents the cellular location of the protein kinase genes in the Gene class
and it takes the string value. For example, the gene “ABL1” has a cellular location “Cytoplasm”.
hasChromosome Mutation String The data property hasChromosome represents the chromosome of a particular mutation in the Mutation class and it
takes the string value. For example, the mutation “S605F” (mutation id 1135) has chromosome “7”.
hasStartLocation and hasEndLocation
FunctionalFeature SubDomain
FunctionalDomain
The data properties hasStartLocation and hasEndLocation are used to represent the start and end locations of the
different functional features represented in the FunctionalFeature class and the subclasses under this class. These
properties are also used to represent the start and end locations of the different instances in the SubDomain class. These
properties take the string values. For example, the functional feature ADCK1-signalpeptide has a start location “1” and
an end location “17”. ACK-B3-strand sub domain has a start location “173” and an end location “187”.
hasFASTAFormat Sequence String The data property hasFASTAFormat represents the FASTA format of the sequences of the protein kinase genes and it
takes the string value. For example, the gene “ABL1” has a FASTA format sequence:
"MGQQPGKVLGDQRRPSLPALHFIKGAGK…………………………………………….GPAATQDFSKLLSSVKEI
SDIVQR" (Complete sequence not shown)
hasMutationAA Mutation String The data property hasMutationAA represents the mutation Amino Acid of a particular mutation in the Mutation class
and it takes the string value. For example, the mutation “S605F” (mutation id 1135) has mutation AA as “p.S605F”.
hasMutationDescription Mutation String The data property hasMutationDescription represents the mutation description of a particular mutation in the Mutation
class and it takes the string value. For example, the mutation “S605F” (mutation id 1135) has mutation description as
“substitution-Missense”.
hasMutationId Mutation String The data property hasMutationId represents the mutation id (COSMIC) of a particular mutation in the Mutation class
and it takes the string value. For example, the mutation “S605F” has a mutation id as “1135”.
52
hasOrganism Gene String The data property hasOrganism represents the organism of the protein kinase genes in the Gene class and it takes the
string value. For example, the gene “ABL1” has an organism “Homo Sapiens”.
hasOtherName Gene String The data property hasOtherName represents the other names (Synonyms) of the protein kinase genes in the Gene class
and it takes the string value. For example, the gene “ABL1” has other names of “p150”, “c-ABL”, “ABL”, “JKT7” and
many others.
hasPosition FunctionalFeature String The data property hasPosition represents the position of the different functional features represented in the
FunctionalFeature class and the subclasses under this class and it takes the string value. For example, the functional
feature “ABL1-phosphoserine” (Modified Residue of the gene “ABL1”) has position “50”.
hasPrimaryName Gene String The data property hasPrimaryName represents the primary or common name (most representative name in literature)
of the protein kinase genes in the Gene class and it takes the string value. For example, the gene “ABL1” has a primary
Name “ABL1”.
hasPrimarySite Mutation String The data property hasPrimarySite represents the primary cancer site of a particular mutation in the Mutation class and
it takes the string value. For example, the mutation “S605F” (mutation id “1135”) has a primary site of ”skin”.
hasPubMedPMID Mutation String The data property hasPUBMEDPMID represents the PUBMED PMID of a particular mutation in the Mutation class
and it takes the string value. For example, the mutation “S605F” (mutation id “1135”) has a PUBMED PMID’s
”16773193” and “15331929”.
hasSubDomainSequence SubDomain String
The data property hasSubDomainSequence represents the subsequence of a particular sub-domain in SubDomain class
for the protein kinase genes and it takes the string value. For example, ACK-B3-strand sub domain has a subsequence
"SGKTVSVAVKCLKPD".
hasTissueSpecificity Gene String The data property hasTissueSpecificity represents the tissue specificity of the protein kinase genes in the Gene class
and it takes the string value. For example, the gene “ABL1” has tissue specificity “Widely expressed”.
3
in
ar
on
(C
U
.3 Architect
Fully
ntegrated so
rchitecture o
ntology pop
COSMIC),
Universal Pro
ture of Syst
developed
ource of kno
of systems b
pulation sou
Protein dat
otein Resour
Figure
ems Based o
and populat
owledge Pro
ased on Pro
urces KinB
ta Bank (PD
rce (UniProt)
14: Architec
53
on ProKinO
ted, ProKinO
otein Kinase
KinO is dep
Base, Catalo
DB), protei
).
cture of syst
O
O is intende
es (initially
picted in Fig
ogue of Som
in families
tems based o
ed to serve
only Huma
ure 14. The
matic Muta
database (P
on ProKinO
as a unified
an). The ov
figure show
ations in Ca
Pfam), and
d and
verall
ws the
ancer
The
54
Data is acquired from these sources automatically and is analyzed and stored in
ProKinO as classes, properties, and instances. To assure that the ontology is up-to-date
and correct, ProKinO will be continually updated by an automatic population process,
described above. Once developed and verified, ProKinO will be publicly available and
accessible from our Web Portal. ProKinO will also be made available from NCBO
BioPortal after it gets included in the BioPortal. In addition to freely downloadable OWL
files containing ProKinO and its prior versions, the portals will provide access to a live
version of the ontology so that portal visitors can browse and explore the ontology.
Furthermore, we plan to create a Web Service providing a programmatic access to
ProKinO.
55
CHAPTER 4
PROKINO LIFE CYCLE
Ontology development is a complex task involving a series of systematic and well
organized steps in the process. Ontology Engineering is a relatively new field but there
has already been a lot of work and research in this field during the last decade.
Ontological Engineering refers to the set of activities that concern the ontology
development process, the ontology life cycle, the methods and methodologies for
building ontologies, and the tool suites and languages that support them [66].
Ontologies in the past have been created by different people by following
different methods and methodologies. The type of ontology being developed, the domain
of discourse to be captured, the expertise of the developers of the ontology, all these
factors drive the selection of one method over another while building ontologies.
Further, what type of applications of ontology have been envisioned, the time and
resources available for the task and other related factors also influence the choice of the
approach. Ontology Life Cycle identifies the stages through which the ontology should
go through during its life time. There are certain sets of activities that are performed in
each stage and different models have been proposed by researchers for formalizing how
the different stages can be related in terms of their order. In developing ProKinO, certain
sets of activities which were performed to accomplish the task are discussed in the
subsequent sections.
56
4.1 Data Acquisition
The data sources of interest to biomedical community are often large, dissimilar
in structure and content, scattered, separately controlled, and rapidly changing [67]. One
of the vital steps in building ontologies is to acquire the knowledge from these kinds of
sources that is to be integrated and represented in the ontologies. This poses several
challenges in information extraction and knowledge acquisition from these disparate
sources. In the development of ProKinO we faced similar challenges in acquiring the data
from the identified sources of the domain knowledge. These data sources, such as
KinBase, COSMIC, PDB, Pfam, and UniProt, provide a range of information related to
the protein kinases but they are stored in different formats and controlled and maintained
by different groups.
We used some open source tools for the overall development process of
ProKinO. We have used Jena API as software support to build and populate ProKinO.
This is an open source tool which was developed by a team at HP lead by Brian Mcbride
and is widely used for ontology development. There is another useful API, called OWL
API, is available for building ontologies, but we used Jana API as it is more mature and
stable. The Jena Ontology API is language-neutral and the Java class names do not
mention the underlying language. For example, the OntClass Java class can represent an
OWL class, RDFS class, or DAML class and to represent the differences between the
various representations, each of the ontology languages has a profile, which lists the
permitted constructs and the names of the classes and properties [68] . Jena is very easy
to use and can be used for parsing, creating and searching RDF models. The key RDF
package for the application developer is com.hp.hpl.jena.rdf.model. The API has been
57
defined in terms of interfaces so that application codes can work with different
implementations without change. Jena includes interfaces for model, statement, resource,
property, object, literal etc. An example for Jena API used in Java code to create an
empty ontology model is shown in Figure 15.
public class CreateProkinoBaseModel {
/** * This method is used to create an Ontology model * @param owlFileName The name of the file in which OWL file has to be stored for the * model that is created. * @return The ontology model. */
public static OntModel createBaseModel( String owlFileName) {
OntModel basemodel; // create an empty model basemodel = ModelFactory.createOntologyModel( OntModelSpec.OWL_MEM_RULE_INF); PrintStream out;
try {
out = new PrintStream(new FileOutputStream(owlFileName)); basemodel.write(out);
}
catch (FileNotFoundException e) {
// TODO Auto‐generated catch block e.printStackTrace();
} return basemodel; } // method createBaseModel end
} // CreateProkinoBaseModel class end
Figure 15: Jena example to create an ontology model.
Protégé is a free, open-source software system that has become a very popular
tool for constructing knowledge-based applications with ontologies which provides a
suite of tools to the user community. There are many knowledge-modeling compositions
and procedures implemented by Protege which facilitate in creation, visualization, and
58
manipulation of ontologies in various representation formats. Protégé can be extended by
way of a plug-in architecture and a Java-based Application Programming Interface (API)
for building knowledge-based tools and applications [69].
The Protégé platform supports two main ways of modeling ontologies:
The Protégé-Frames editor is a way of modeling ontologies that provides support to
users in constructing and storing frame-based domain ontologies, customizing data entry
forms, and entering instance data by presenting a full-fledged user interface and
knowledge server. Protégé-Frames use the Open Knowledge Base Connectivity protocol
(OKBC) to implement a knowledge model. The Protégé-OWL editor is an expansion of
Protégé that supports the Web Ontology Language (OWL) which is the standard ontology
language recommended by the World Wide Web Consortium to prop up the Semantic
Web vision. The Protégé-OWL editor provides users support to load and save OWL and
RDF ontologies, edit and visualize classes, properties, define logical class characteristics
as OWL expressions and execute reasoners such as description logic classifiers [69].
For acquiring data from different sources we have written customized software in
Java which performs many required functions in a single unified manner. This software
first of all automatically fetches the relevant data from KinBase about Protein kinase
genes and the associated attributes and after processing the data populates ProKino. Then,
this populated information in our ontology further becomes the basis of acquiring and
parsing the data from other sources of knowledge. The software then automatically
retrieves the needed information by running certain BLAST searches against the protein
kinases data provided by ProKinO to produce pertinent data about functional domains,
motifs, structures, and functional features. These whole sets of data are processed and
p
ex
d
k
d
pr
an
u
so
arsed furthe
xample, the
atabase by r
inases. Thes
efined for th
rotein kinas
nd then fetch
sing already
ources for Pr
r by our soft
BLAST sof
running BLA
se files are p
hat relevant
es is acquire
hing all mut
y available k
roKinO is sh
Figur
ftware to extr
ftware fetche
AST automa
parsed to cre
part of kno
ed by retriev
tation related
knowledge i
hown in Figu
re 16: Data a
59
ract needed
es large num
atically to ge
eate the ProK
owledge. Th
ving the late
d data for m
n ProKinO.
ure 16.
acquisition fr
knowledge
mber of Pfam
et the functio
KinO class,
he knowledg
est mutation
mutations link
The data ac
from ProKinO
to be stored
m-HMM file
onal domain
properties a
ge about mut
n data dump
ked to the p
cquired from
O sources
d in ProKinO
es from the
n data for pr
and instance
tations relat
s from COS
rotein kinas
m these disp
O. For
Pfam
rotein
e data
ted to
SMIC
es by
parate
60
For instance, from KinBase, we acquire data regarding kinase genes and their
species, their corresponding groups, families and subfamilies, chromosomal location, and
the synonyms and acronyms used for protein kinase genes in the literature. Likewise,
from PDB, we acquire information which includes the PDB ID, three dimensional
coordinates and structure abstract. From COSMIC, we retrieve information regarding the
mutation location, mutation type, tissue type, and cancer type and literature reference.
From the Pfam database, we acquire information about identification and classification of
functional domains in protein kinase sequences, and from UniProt knowledge is acquired
about functional features of protein kinase genes.
4.2 Data Integration and Ontology Population
After acquiring the data relevant to ProKinO from disparate heterogeneous
sources and processing further that data, we integrate this knowledge in ProKinO and
automatically populate the ontology. Our software populates ProKinO with the acquired
knowledge automatically. To integrate the various classes, i.e. the diverse forms of data
in ProKinO, we have developed object properties and relationships that relate sequence,
structure, function and mutation data in meaningful ways. For example, the 'occursIn'
relationship between Mutation and Sequence classes relates a mutation entry in the
Mutation class with a gene entry in the Sequence class. The other relationships among
various classes further integrate these, otherwise disparate, sources of data with each
other.
The automatic process of data population from the KinBase database integrates
and populates the information about the protein kinase genes and their species, their
corresponding groups, families and subfamilies, the other names by which they are
61
known across the domain and about the chromosomal position. Also, the knowledge
about the sequences of protein kinase genes with the information about their FASTA
Formats is populated. The acquired knowledge becomes the basis for automatically
populating the ProKinO ontology by inserting the instances of the genes for Protein
Kinase Group, Protein Kinase Family, Protein Kinase Sub-Family classes in Gene class,
instances for Sequence class, and the data properties for the corresponding classes.
The automatic process of mining COSMIC database in the ProKinO ontology
population extracts the information about the mutations related to cancer by capturing
MutationID, MutationAA, Chromosome, Primary site, Pubmed_PMID, Description and
other relevant information from COSMIC. Then, ProKinO is populated with instances of
classes Mutation and Disease along with their corresponding data properties.
Information integrated and populated from UniProt is about functional features
such as the modified residue, signal peptide, topological domain, about cellular location,
tissue specificity and organism and about Wikipedia, PubMed, MEDLINE and UniProt
database cross references. The data property of hasFunctionalFeature is used to capture
functional features for protein kinase genes depicting the modified residue, signal peptide
and the topological domain. The information populated from Protein Data Bank is about
the crystal structures and PDB database references. ProKinO is populated with instances
of class Structure, as well as their corresponding object and data properties.
Pfam and related Hidden Markov Models (HMM) provide information about
identification and classification of functional domains in the protein kinase sequences.
The ProKinO ontology population from the Pfam resource is done by integrating the
information about the functional domains related to protein kinases and then populating
P
pr
in
E
th
fo
roKinO with
roperties. A
n Figure 17.
Fig
Example Rev
Earlie
he challenge
ormats. Now
h instances o
snapshot of
gure 17: A s
visited
er in one of
es faced in
w, once hav
of class Fun
f the populat
snapshot of P
the previous
integrating
ving the kno
62
nctionalDom
ted ontology
Protégé edito
s sections, w
information
owledge int
main, along w
y displayed in
or showing p
we have disc
n from dispa
egrated in p
with its corr
n the Protég
populated Pr
cussed an ex
arate source
populated P
responding o
é editor is sh
roKinO
xample to po
s using diff
ProKinO, we
object
hown
ortray
ferent
e can
re
th
w
fu
ha
evisit the sam
hat we have
we can get th
urther we
asFunctiona
me example
a mutation
he ProteinKin
can get
alDomain ob
Figure 18:
query to sho
(33750 in th
nase gene w
the funct
bject propert
Knowledge
63
ow the solut
his example
with foundIn
tional dom
ty as PH, Pk
e discovery th
tion provided
e) stored as a
object prope
mains relate
kinase, Pkina
hrough popu
d by our ont
a member o
erty of ProK
ed with t
ase_C and Pk
ulated ProKi
tology. Assu
f class Muta
KinO ( AKT1
this gene
kinase_Tyr.
inO
uming
ation,
) and
with
64
The sequence and structure of this gene associated with the in question mutation
can be retrieved as Seq_AKT1 and 3CTQU respectively with properties hasSequence and
hasStructure. The functional feature of cellular location of this gene is provided by
hasCellularLocation as Cytoplasm. This example of knowledge discovery is depicted in
Figure 18.
We should keep in mind that this output of the whole linked information provided
as a unit is basically coming from disparate and separately formatted sources, for which
otherwise we would have to use customized processes to obtain the same results.
Clearly, we see that although the information required may be stored in many
heterogeneous sources in the original databases yet we can get the specific answers to the
queries by navigating through the classes and properties in ProKinO. We believe that
ProKinO will emphasize the usefulness of this type of integration of knowledge in the
form of ontology.
4.3 Curation and Ontology Modification
The ontologies are bound to evolve with the time because with any changes in the
sources of knowledge in the domain, suitable changes to the schema of representation of
knowledge will have to be incorporated, to make the ontology up-to-date and consistent.
Curation: To assure that the ontology is up-to-date and correct, ProKinO will be
continually updated by the automatic population processes described above.
Furthermore, to assure the maximum correctness of the ProKinO data, a domain expert
will be designated to monitor and verify the ontology updates. The ontology curator will
also be charged with introducing any needed schema corrections and its extensions.
Scientists working in the area of protein kinases will be able to submit relevant data for
65
inclusion in ProKinO, using a convenient GUI-type interface. The curator will be
charged with reviewing such submissions and adding such entries upon approval.
4.4 ProKinO Revisions
As an ontology undergoes necessary modifications, the newly updated ontologies
are saved as versions. Ontology versioning is the concept of keeping multiple versions of
ontologies to mange changes and evolution in ontologies. The compatibility of versions
must be checked for instance-data preservation (no data lost between versions unless
explicitly warranted), ontology preservation (query is satisfied in both versions),
consequence preservation (all the facts could be inferred equally from the new version as
inferred from older one) and consistency preservation (no logical inconsistencies) [70].
As ProKinO is an integration of knowledge from disparate sources and the
integration is done without modifying the original sources, it has to be in consonance
with all these dynamic sources that are subject to frequent modifications. For the
knowledge integrated in the ontology to be current and consistent with the existing data
available in the parent sources and to make any changes in the conceptualization and
specification, ProKinO will be subjected to regular revisions, as agreed upon by the
community. We will be keeping the different versions of ProKinO along with the
information about the differences among these versions. The ontology lifecycle will be
tracked by a versioning system and any prior versions of ProKinO will be easily
accessible.
66
4.5 Ontology Dissemination and Evaluation
When the ontologies are disseminated in the form of applications based on the
knowledge contained in them, they can be considered as good quality and valuable
ontologies only if they are able to serve the intended purposes. To look for ontologies that
can be viewed as complete is actually an unrealistic goal. As ontology building is an
expensive and time consuming process. Once the ontology is developed by following the
selected development approaches, it must be evaluated for its quality by following certain
evaluation criteria.
An ontology is as good as the knowledge it contains, so for evaluating specialized
ontologies, their evaluation should be based on the specific needs of the users of that
domain. The simple measures of precision and recall are not that easy to be applied to
ontologies as is the case of other knowledge extraction methods. In case of specialized
ontologies these metrics may assume different notions in different kinds of applications.
Specialized ontologies, such as ProKinO, can be evaluated on the basis of satisfying the
specified needs of intended users. So, we tested and performed an evaluation of ProKinO
based on the requirements of the protein kinase community.
The domain experts in Evolutionary Systems Biology Group (ESBG) lab at The
University of Georgia evaluated the ontology by designing their specific queries
dependent on the knowledge existing in it and then checking the results manually. As this
was a basic evaluation conducted by a subset of user community we expect the ontology
to be further evaluated on the basis of usage by wider protein kinase community while
ProKinO being available in public domain. Every ontology has its ultimate evaluation of
quality and success based on whether it is used and accepted widely by the community or
67
not. Therefore, we are providing a mechanism, so that ProKinO is easily available to all
through our web service. To further strengthen the process of evolution and refinement,
our community will be provided the facility to give its feedback for further improvement.
68
CHAPTER 5
POTENTIAL APPLICATIONS OF PROKINO
We envision a set of semantic, ontology-based bioinformatics applications
utilizing knowledge represented in ProKinO. We plan to create an ontology browsing
visualization tool, available via a standard Web browser. Also, we are going to use
ProKinO in two major applications. First, to mine the wealth of scientific literature data
that is accumulating on protein kinases and second, to annotate the vast amounts of
sequence data generated from cancer genome sequencing studies. A variety of other
applications are possible, as well.
5.1 ProKinO Browsing
For ontologies to be fully adapted by a user community the knowledge contained
in them must be accessible by simple means of browsing through the concepts and
relationships of ontologies. There are many editors (e.g. Protégé) which are providing
the very basic browsing about going through the ontology constituents. But it has been
seen that most website based ontology editors use separate HTML pages not just for each
entity, but for each view of those entities and this distances the user and the ontology
itself [71]. There are many general purpose ontology browsing tools such as the one
offered by OwlDoc plugin (shown in Figure 19 for ProKinO) available in public domain.
However, there is a distinct need to provide specialized browsers to deal with specific
ontologies. A specialized ontology browser must take into account the specific
re
co
a
b
ac
k
fr
elationships
oncepts from
form most s
Figure
We ar
e available v
ccess to the
inase comm
riendly navig
present in t
m the less im
suitable to th
e 19: A snap
re creating an
via a standa
ProKinO kn
munity to po
gational capa
the ontology
mportant one
he intended o
shot of elem
n ontology b
ard Web bro
nowledge bu
ose queries
abilities.
69
y and disting
es. Furtherm
ontology aud
mentary brow
browsing and
owser. This
ut also certai
for knowle
guish among
ore, the con
dience.
wsing of ProK
d visualizati
tool will no
in specific c
dge extracti
g important
cepts should
KinO (using
on tool for P
ot only prov
capabilities s
ion along w
and often vi
d be visualiz
g OwlDoc).
ProKinO tha
vide fundam
sought by pr
with general
isited
zed in
at will
mental
rotein
user
70
5.2 Text Mining
The main motivation behind the text mining systems is that most of the world’s
published scientific data is in unstructured or semi-structured form. This becomes more
significant when the researcher or scientist have to deal with huge wealth of literature in
biomedical domain. The overwhelming information bewilders the potential user and he
is not able to keep up with the relevant publications in his own discipline leave aside the
related disciplines. So a constant effort is always neede to search for solutions to absorb
the high flow of new scientific literature. Text-mining tools are becoming indispensable
now for extracting information from the biomedical literature. The natural language
processing field has distinguished between information retrieval and information
extraction with information retrieval said to be recovering a pertinent subset of
documents while information extraction seen as a process of obtaining pertinent
information from documents [72].
Text mining can be defined as a knowledge extracting method to extract useful
and previously unknown information from a document set of texts through the
identification of facts inherent and inexplicit in the data. Biomedical Text mining is based
on using the automated text mining tools for extracting the vast amount of knowledge
existing in the biomedical literature. A biomedical text mining tool includes a component
of text mining which extracts biomedical concepts and entities in the literature and then
the relationships between biomedical entities are detected. Relation Extraction methods
range from co-occurrence, as the simplest way to detect relations between biomedical
entities is to collect texts or sentences in which they co-occur, to patterns, detecting
71
individual hypothetical instances of relations, which can be aggregated over a corpus, and
then to fuller parsing, producing more elaborate syntactic information [73].
The text mining systems can have a wide scope ranging from very simple tasks of
recognizing named entities and their categorization to more complex tasks of
summarizing, question answering, processing non-textual texts. All these approaches can
be further used for making literature based discoveries that can become the foundation of
new hypothesis to research upon. More recently, the focus has shifted for text mining
systems to address the user specific needs and the specialized applications are developed
for solving the user problems.
We intend to develop a text mining system which would allow scientists to
formulate advanced search queries, unlike the typical “bag of terms” queries adopted by
most search engines today. In essence, scientists would rely on our system to
automatically integrate the knowledge from several information sources while
formulating an advanced query, which otherwise would be very challenging. Such an
advanced query, represented as a graph, would include concepts (and their synonyms)
and data instances, all semantically interconnected by relevant relationships retrieved
from ProKinO. The query would be used to search for publications in which the concepts
and data items as recognized in a document section would match such a query graph, or
its significant sub-graphs.
5.3 Cancer Genome Annotation
One of our main focuses at present is on creation of an automated cancer genome
annotation system based on ProKinO. Genomic annotation is defined as the process of
72
marking the genes and other biological features in a DNA sequence. Dr. Owen White was
the first person to develop a software system of genome annotation in 1995. He was the
member of the team at The Institute for Genomic Research [74] that sequenced and
analyzed the first genome of a free-living organism to be decoded.
Cancers arise due to the buildup of mutations in critical genes that change normal
programmes of cell proliferation, differentiation and death [75]. The cancer research has
its focus on finding the mutated genes that are implicated in cancer development. A
‘census’ of cancer genes that are also known as oncogenesis, indicates that mutations in
more than 1% of genes contribute to human cancer and the protein kinase is the domain
most commonly found among known cancer genes [76]. The coding sequences of the
protein kinases make a much larger sample of cancer genome to look into the general
patterns of somatic mutation in human cancers [77].
To consistently and accurately annotate protein kinase mutations in upcoming
cancer genome sequencing studies we introduced a class in ProKinO to provide a short
description regarding the structural location and evolutionary conservation for every
mutation. Once the annotation class is sufficiently populated, we will develop an
application that will essentially transfer information from the annotation class to a newly
identified mutation in cancer genome sequencing studies. We realize that novel mutations
that do not exist in ProKinO cannot be annotated this way. However, by frequently
updating the ontology, we will be able to address this issue. In this way we will be able to
provide a consistent annotation for protein kinase mutations discovered in cancer
genomes and allow cancer researchers to prioritize mutations for experimental studies.
73
CHAPTER 6
CONCLUSION AND FUTURE WORK
Protein kinases are a large family of proteins that are implicated in many diseases
such as human cancer and have been broadly studied both from the basic and clinical
point of view. In the public domain, there are a few ontologies available that are serving
the domains of protein families, but none of these is directly related to the protein kinases
and there is a need in satisfying the requirements of the protein kinase community. To
fill this existing need and keeping in view the huge importance of protein kinases in
protein family, we have developed a Protein Kinases Ontology (ProKinO) that is a
comprehensive and a specialized ontology for protein kinases.
The data and information about protein kinases domain is spread across several
heterogeneous resources and most of these resources are storing data in different formats
following different schema. There are many challenges faced by the protein kinase
community due to the difficulty in integrating data from these disparate sources and
heterogeneous data formats and that becomes a hindrance in utilizing existing knowledge
for research related to diseases like cancer. ProKinO is an endeavor to deal with this
problem and provide a framework for detailed understanding of the relationships between
sequence, structure, function and disease in the protein kinase family.
ProKinO has been developed to capture, integrate and represent sequence,
structure, function, motif and disease information on protein kinases and provide a
sharable and consistent vocabulary to formally specify concepts and their relationships in
74
the domain of protein kinases. Our ProKinO population system does this by
automatically extracting information from diverse sources, such as Protein Data Bank
(PDB), Protein Families database (Pfam), KinBase, Catalogue of Somatic Mutations in
Cancer (COSMIC), and The Universal Protein Resource (UniProt) and then integrating
data to automatically populate knowledge in the ontology. The ProKinO then serves as a
knowledge base about the protein kinase domain allowing the protein kinase community
to navigate this specialized knowledge in one place and also building applications of their
interest.
ProKinO has become the basis for ongoing work on developing a text mining
system that mines the wealth of protein kinase literature accumulating constantly due to a
huge growth in the information about the structure, function, interaction and evolution of
protein kinases. The text mining system would allow scientists to formulate advanced
search queries, unlike the typical “bag of terms” queries adopted by most search engines
today. In essence, scientists would rely on our system to automatically integrate the
knowledge from several information sources while formulating an advanced query,
which otherwise would be very challenging.
To provide an elementary access to the ProKinO knowledge along with certain
capabilities specifically sought by protein kinase community, we are building an ontology
browsing and visualization tool for ProKinO. This browsing visualization tool will be
available via a standard Web browser. Through this utility researchers can pose queries
for knowledge extraction along with general user friendly navigational capabilities.
The integrated knowledge in ProKinO will be used to consistently and accurately
annotate protein kinase mutations in upcoming cancer genome sequencing studies. This
75
way the protein kinase mutations discovered in cancer genomes can be provided a
consistent annotation and cancer researchers can prioritize mutations for experimental
studies.
The OBO foundry establishes the main requirements for joining OBO and one has
to agree to adopt and refine a set of principles that prove effective for ontology
development in serving the biomedical research community. As OBO principles have
been followed in the development of ProKinO and it serves an important domain that is
not already covered in any of the available ontologies, we hope it to be a part of OBO in
the near future. The efforts are underway to get ProKinO included in the Open
Biomedical Ontologies Foundry.
In the future, one goal can be to integrate the pathway and sequence variation data
for visualization, analysis and modeling purposes using the knowledge present in
ProKinO. Presently, ProKinO has been developed for protein kinases which are a specific
family of proteins and it focuses only on human organism, but in the future this work can
be extended to include knowledge for other related important biomedical families of
proteins such as phosphates and hydrolyses, and also incorporating other organisms.
76
REFERENCES
1. Noy, N.F. and M. Klein, Ontology Evolution: Not the Same as Schema Evolution.
Knowledge and Information Systems, 2004. 6: p. 428–440. 2. Janik, M. and K.J. Kochut, Wikipedia in Action: Ontological Knowledge in Text
Categorization, in Proceedings of the 2008 IEEE International Conference on Semantic Computing. 2008, IEEE Computer Society. p. 268-275.
3. The Gene Ontology Project. 25 March 2010, date last accessed]; Available from:
http://www.geneontology.org/. 4. Sequence Ontology Project (SO). 25 March 2010, date last accessed]; Available
from: http://www.sequenceontology.org/. 5. The Open Biological and Biomedical Ontologies (OBO) Foundry. 25 March
2010, date last accessed]; Available from: http://www.obofoundry.org/. 6. Natale, D.A., et al., Framework for a protein ontology. BMC Bioinformatics,
2007. 8 Suppl 9: p. S1. 7. Protein-protein Interaction Ontology in OBO Foundry. 25 March 2010, date last
accessed]; Available from: http://www.obofoundry.org/cgi-bin/detail.cgi?id=psi-mi.
8. Protein-modification ontology in OBO Foundry. 25 March 2010, date last
accessed]; Available from: http://www.obofoundry.org/cgi-bin/detail.cgi?id=psi-mod.
9. The Kinase Database at Sugen/Salk. 25 March 2010, date last accessed];
Available from: http://kinase.com/kinbase/. 10. Catalogue of Somatic Mutations in Cancer (COSMIC) 25 March 2010, date last
accesssed]; Available from: http://www.sanger.ac.uk/genetics/CGP/cosmic/. 11. RCSB Protein Data Bank: An Information Portal to Biological Macromolecular
Structures. 25 March 2010, date last accessed]; Available from: http://www.pdb.org/pdb/home/home.do.
12. Protein Families Database (Pfam) 25 March 2010, date last accessed]; Available
from: http://pfam.sanger.ac.uk/.
77
13. The Universal Protein Resource (Uniprot) 25 March 2010, date last accessed]; Available from: http://www.uniprot.org/.
14. Hendler, J., et al., Web science: an interdisciplinary approach to understanding
the web. Commun. ACM, 2008. 51(7): p. 60-69. 15. The Size of the World Wide Web. Retrieved 25 March 2010]; Available from:
http://www.worldwidewebsize.com/. 16. Jesse , A. and H. Nissan. The Official Google Blog: "We knew the Web was
big...". Retrieved 25 July 2008]; Available from: http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
17. Domain Counts & Internet Statistics. Retrieved 25 March 2010]; Available
from: http://www.domaintools.com/internet-statistics/. 18. Lee, T.B., J. Handler, and O. Lassila, The Semantic Web. Scientific American,
2001. 19. World Wide Consortium. 25 March 2010, date accesssed last]; Available from:
http://www.w3.org/. 20. World Wide Web Consortium: Semantic Web. 25 March 2010]; Available from:
http://www.w3.org/2001/sw/. 21. http://www.w3.org/2001/sw/. World Wide Web Consortium: Semantic Web Layer
Cake. 25 March 2010, date last accessed; Available from: http://www.w3.org/2007/03/layerCake.png.
22. Resource Description Framework (RDF): Concepts and Abstract Syntax. 2004;
Available from: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/. 23. Lee Feigenbaum, E.P.h. SPARQL By Example: A Tutorial. 28 March 2010, date
last accessed]; Available from: http://www.cambridgesemantics.com/2008/09/sparql-by-example.
24. SPARQL endpoint. 25 March 2010, date last accessed]; Available from: http://semanticweb.org/wiki/SPARQL_endpoint.
25. Gruber, T.R., A translation approach to portable ontology specifications.
Knowledge Acquisition, 1993. Vol. 5: p. 199-199. 26. Gruber, T.R., Toward principles for the design of ontologies used for knowledge
sharing. International Journal of Human-Computer Studies, 1995. Vol. 43(4-5): p. 907-928.
78
27. W3C: Ontologies in Semantic Web Context. 25 March 2010]; Available from: http://www.w3.org/2001/sw/SW-FAQ#whont.
28. Noy, N.F. and D.L. McGuinness, Ontology Development 101: A Guide to
Creating Your First Ontology. 2001. 29. Bard, J., Ontologies: Formalising biological knowledge for bioinformatics.
Bioessays, 2003. 25(5): p. 501-6. 30. The Suggested Upper Merged Ontology (SUMO) 25 March 2010, date last
accessed]; Available from: http://www.ontologyportal.org/. 31. Basic Formal Ontology. 25 March 2010, Date last accessed]; Available from:
http://www.ifomis.org/bfo. 32. IDEAS group. 25 March 2010, date last accessed]; Available from:
http://en.wikipedia.org/wiki/IDEAS_Group. 33. Antezana, E., M. Kuiper, and V. Mironov, Biological knowledge management:
the emerging role of the Semantic Web technologies. Brief Bioinform, 2009. 10(4): p. 392-407.
34. Bodenreider, O., Biomedical ontologies in action: role in knowledge
management, data integration and decision support. Yearb Med Inform, 2008: p. 67-79.
35. Bodenreider, O., Ontology and Data Integration in Biomedicine: Success Stories
and Challenging Issues. 2008, Berlin Heidelberg New York: Springer: Proceedings of the Fifth International Workshop on Data Integration in the Life Sciences. p. 1-4.
36. The National Centre for Biomedical Ontology (NCBO). 26 March 2010];
Available from: http://bioontology.org. 37. National Institutes of Health. 1 April 2010, date last accessed; Available from:
http://www.nih.gov/. 38. The National Center for Ontological Research (NCOR). 26 March 2010, date
last accessed]; Available from: http://ncor.us/. 39. The European Centre for Ontological Research. 26 March 2010, date last
accessed]; Available from: http://www.ecor.uni-saarland.de/home.html. 40. Smith, B., et al., The OBO Foundry: coordinated evolution of ontologies to
support biomedical data integration. Nat Biotechnol, 2007. 25(11): p. 1251-5.
79
41. The Open Biological and Biomedical Ontologies 26 March 2010, date last accessed]; Available from: http://www.obofoundry.org/.
42. Noy, N.F., et al., BioPortal: ontologies and integrated data resources at the click
of a mouse. Nucleic Acids Res, 2009. 37(Web Server issue): p. W170-3. 43. NCBO BioPortal. 1 April 2010, date last accessed]; Available from:
http://bioportal.bioontology.org/. 44. The Gene Ontology (GO). 29 March 2010, date last accessed]; Available from:
http://www.geneontology.org/. 45. Protein Kinases. 26 March 2010, date last accessed]; Available from:
http://en.wikipedia.org/wiki/Protein_kinase. 46. Wikipedia. Protein Kinase Phosphorylation. 25 March 2010, date last accessed;
Available from: http://en.wikipedia.org/wiki/Protein_kinase. 47. Protein Kinase Research. 26 March 2010, date last accessed]; Available from:
http://www.kinaseresearch.com/. 48. Sidhu, A.S., T.S. Dillon, and E. Chang, An Ontology for Protein Data Models, in
Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. 2005: Shanghai, China,. p. 1-4.
49. kinase.com. 26 March 2010, date last accessed; Available from:
http://kinase.com/about/Acknowledgements.html. 50. Manning, G., et al., The protein kinase complement of the human genome.
Science, 2002. 298(5600): p. 1912-34. 51. Kinase.com. Human Kinome Poster. 24 March 2010, date last accessed];
Available from: http://kinase.com/human/kinome/. 52. Sequences of Protein Kinases. 26 March 2010, date last accessed; Available
from: http://kinase.com/kinbase/FastaFiles/Human_kinase_protein.fasta. 53. Catalogue Of Somatic Mutations In Cancer. 26 March 2010, date last accessed];
Available from: http://www.sanger.ac.uk/genetics/CGP/cosmic/. 54. Forbes, S.A., et al., COSMIC (the Catalogue of Somatic Mutations in Cancer): a
resource to investigate acquired mutations in human cancer. Nucleic Acids Res, 2010. 38(Database issue): p. D652-7.
80
55. Protein Families (Pfam) Representation with Multiple Sequence Alignments and Hidden Markov Models. 26 March 2010, date last accessed]; Available from: http://pfam.sanger.ac.uk/.
56. Protein Families (Pfam) Representation with Multiple Sequence Alignments and
Hidden Markov Models. 27 March 2010, date last accessed]; Available from: http://pfam.sbc.su.se/.
57. Finn, R.D., et al., The Pfam protein families database. Nucleic Acids Res, 2010.
38(Database issue): p. D211-22. 58. The UniProt Knowledgebase (UniProtKB). 26 March 2010, date last accessed];
Available from: http://www.uniprot.org/help/uniprotkb. 59. Heger, A., et al., ADDA: a domain database with global coverage of the protein
universe. Nucleic Acids Res, 2005. 33(Database issue): p. D188-91. 60. The Worldwide Protein Data Bank (wwPDB). 24 March 2010, date last
accessed]; Available from: http://www.wwpdb.org/. 61. Biological Magnetic Resonance Data Bank. 26 March 2010, date last accessed];
Available from: http://www.bmrb.wisc.edu/. 62. Crystal Structure of the Complex of Human Epidermal Growth Factor and
Receptor (1IVO) Extracellular Domains. 26 March 2010, date last accessed]; Available from: http://www.pdb.org/pdb/explore/.
63. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res, 2009.
37(Database issue): p. D169-74. 64. The UniProt Reference Clusters (UniRef) 24 March 2010, date last accessed];
Available from: http://www.uniprot.org/help/uniref. 65. About Uniprot. 25 March 2010, date last accessed]; Available from:
http://www.uniprot.org/help/about. 66. Gómez-Pérez A, F.-L.M., Corcho O, Ontological Engineering. Springer–Verlag,
London, United Kingdom, 2003. 67. Vasant Honavar, C.A., Doina Caragea, Adrian Silvescu, Jaime Reinoso-Castillo,
and Drena Dobbs, Ontology-Driven Information Extraction and Knowledge Acquisition from Heterogeneous,Distributed, Autonomous Biological Data Sources. 2001.
68. Jena Ontology API. 25 March 2010, date last accessed; Available from:
http://jena.sourceforge.net/ontology/index.html.
81
69. The Protégé Ontology Editor and Knowledge Acquisition System. 25 March 2010, date last accessed]; Available from: http://protege.stanford.edu/.
70. Natalya F. Noy, M.K., Ontology Evolution: Not the Same as SchemaEvolution.
Knowledge and Information Systems, 2004. 6: p. 428–440. 71. Christopher Brewster, H.A., An Dasmahapatra Data Driven Ontology Evaluation,
in International Conference on Language Resources and Evaluation 2004. 72. Muller, H.M., E.E. Kenny, and P.W. Sternberg, Textpresso: an ontology-based
information retrieval and extraction system for biological literature. PLoS Biol, 2004. 2(11): p. e309.
73. Zweigenbaum, P., et al., Frontiers of biomedical text mining: current progress.
Brief Bioinform, 2007. 8(5): p. 358-75. 74. The Institute for Genomic Research (TIGR) 27 March 2010, date last accessed];
Available from: http://www.jcvi.org/. 75. Davies, H., et al., Mutations of the BRAF gene in human cancer. Nature, 2002.
417(6892): p. 949-54. 76. Futreal, P.A., et al., A census of human cancer genes. Nat Rev Cancer, 2004. 4(3):
p. 177-83. 77. Greenman, C., et al., Patterns of somatic mutation in human cancer genomes.
Nature, 2007. 446(7132): p. 153-8.