+ All Categories
Home > Documents > PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON …

PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON …

Date post: 09-Dec-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
91
PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON PROTEIN KINASES by GURINDER PAL SINGH GOSAL (Under the Direction of KRZYSZTOF J. KOCHUT and NATARAJAN KANNAN) ABSTRACT The prominent role protein kinases play in cell regulation and disease has given rise to an abundance of information about the structure, function, interactions and evolution of these proteins. This information, however, is currently spread across several heterogeneous resources, an obstacle to the kind of integrative approaches needed in utilizing existing knowledge for research related to diseases. We have designed and developed an ontology for protein kinases, ProKinO, that serves as a useful and efficient representation of the integrated knowledge about these complex proteins which are intimately involved in the genesis and behavior of cancer cells. ProKinO captures concepts and relationships important to protein kinases and the instances are populated from disparate resources including KinBase, COSMIC, Protein Data Bank, UniProt and Pfam. ProKinO has potential applications in text mining in the protein kinase literature; cancer genome annotation in cancer genome sequencing studies and research related to protein kinases and associated domains. INDEX WORDS: Kinase, Ontology, Semantic Web, Text Mining, Annotation.
Transcript

PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON PROTEIN

KINASES

by

GURINDER PAL SINGH GOSAL

(Under the Direction of KRZYSZTOF J. KOCHUT and NATARAJAN KANNAN)

ABSTRACT

The prominent role protein kinases play in cell regulation and disease has

given rise to an abundance of information about the structure, function, interactions and

evolution of these proteins. This information, however, is currently spread across several

heterogeneous resources, an obstacle to the kind of integrative approaches needed in

utilizing existing knowledge for research related to diseases. We have designed and

developed an ontology for protein kinases, ProKinO, that serves as a useful and efficient

representation of the integrated knowledge about these complex proteins which are

intimately involved in the genesis and behavior of cancer cells. ProKinO captures

concepts and relationships important to protein kinases and the instances are populated

from disparate resources including KinBase, COSMIC, Protein Data Bank, UniProt and

Pfam. ProKinO has potential applications in text mining in the protein kinase literature;

cancer genome annotation in cancer genome sequencing studies and research related to

protein kinases and associated domains.

INDEX WORDS: Kinase, Ontology, Semantic Web, Text Mining, Annotation.

PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON PROTEIN

KINASES

by

GURINDER PAL SINGH GOSAL

BA, Punjabi University, India, 1991

MCA, Punjabi University, India, 1996

A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial

Fulfillment of the Requirements for the Degree

MASTER OF SCIENCE

ATHENS, GEORGIA

2010

© 2010

GURINDER PAL SINGH GOSAL

All Rights Reserved

PROKINO: DESIGN AND DEVELOPMENT OF ONTOLOGY ON PROTEIN

KINASES

by

GURINDER PAL SINGH GOSAL

Major Professor: Krzysztof J. Kochut

Co-Major Professor: Natarajan Kannan Committee: John A. Miller

Hamid R. Arabnia Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia May 2010

iv

DEDICATION

I would like to dedicate this thesis to my parents Kartar Singh Gosal and Balwant

Kaur Gosal, my wife Ruby, my son Harjap and my entire dear and near ones. I would have

never been able to complete this thesis without their love and support. I would like to make a

very special mention of my ancestral village Kakrala which has always have been a source of

inspiration throughout my life.

v

ACKNOWLEDGEMENTS

I am extremely grateful to my major professor, Dr. Krzysztof J. Kochut, for his

invaluable supervision and encouragement throughout my research and academic study in

The University of Georgia. I am also grateful to my co-major professor, Dr. Natarajan

Kannan, for his role as my guide in the biomedical domain on which my thesis work is

primarily based upon. I would also like to thank Dr. John A. Miller and Dr. Hamid R.

Arabnia for serving on my advisory committee, and for their valuable advice and guidance in

my research.

I would like to thank my fellow researchers in Dr. Kochut’s LSDIS Research Group

in the Computer Science department and also my fellow lab members of in Dr. Kannan’s

Evolutionary Systems Biology Group in the Biochemistry and Molecular Biology

department, for their help and support throughout my period of study and research at The

University of Georgia.

vi

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS .................................................................................................v

LIST OF TABLES ........................................................................................................... viii

LIST OF FIGURES ........................................................................................................... ix

CHAPTER

1 INTRODUCTION...........................................................................................1

1.1 Motivation and Need.............................................................................1

1.2 Contributions ........................................................................................3

1.3 Scope .....................................................................................................4

2 BACKGROUND AND LITERATURE SURVEY .......................................6

2.1 Semantic Web Vision ...........................................................................6

2.2 Ontologies ...........................................................................................17

2.3 Biomedical Knowledge Management .................................................21

2.4 Biomedical Ontologies ........................................................................22

2.5 Biomedical Ontologies Applications ..................................................24

2.6 Information Sources in Protein Kinase Domain .................................27

2.7 Challenges in Integrating Protein Kinase Knowledge ........................30

3 DESIGN OF PROKINO...............................................................................33

3.1 Heterogeneous Knowledge Sources ...................................................33

3.2 ProKinO Design ..................................................................................43

vii

3.3 Architecture of Systems Based on ProKinO ........................................53

4 PROKINO LIFE CYCLE ............................................................................55

4.1 Data Acquisition .................................................................................56

4.2 Data Integration and Ontology Population .........................................60

4.3 Curation and Ontology Modification ..................................................64

4.4 ProKinO Revisions .............................................................................65

4.5 Ontology Dissemination and Evaluation ............................................66

5 POTENTIAL APPLICATIONS OF PROKINO .......................................68

5.1 ProKinO Browsing..............................................................................68

5.2 Text Mining ........................................................................................70

5.3 Cancer Genome Annotation ................................................................71

6 CONCLUSION AND FUTURE WORK ....................................................73

REFERENCES .................................................................................................................76

viii

LIST OF TABLES

Page

Table 1: Kinase distribution by major groups in human systems .....................................36

Table 2: Classes in ProKinO ............................................................................................46

Table 3: Object properties in ProKinO .............................................................................48

Table 4: Data properties in ProKinO ................................................................................51

ix

LIST OF FIGURES

Page

Figure 1: The Semantic Web layered specification ............................................................8

Figure 2: A RDF statement depicting a triple graphically ................................................10

Figure 3: Web Ontology Language (OWL) and its three sublanguages ...........................11

Figure 4: An ontology fragment in OWL ........................................................................14

Figure 5: A view of Open Link’s Virtuoso Generic Endpoint .........................................16

Figure 6: Protein kinases phosphorylation ........................................................................27

Figure 7: Challenges in integrating protein kinase knowledge .........................................31

Figure 8: Human kinome poster .......................................................................................34

Figure 9: The Protein kinase classification in groups by KinBase ...................................35

Figure 10: Pfam display of protein domain architecture ................................................39

Figure 11: Crystal structure of protein kinase gene product EGFR ................................40

Figure 12: The Universal Protein Resource (UniProt) databases .....................................42

Figure 13: Conceptualization of ProKinO ........................................................................45

Figure 14: Architecture of systems based on ProKinO ....................................................53

Figure 15: Jena example to create an ontology model ......................................................57

Figure 16: Data acquisition from ProKinO sources ..........................................................59

Figure 17: A snapshot of Protégé editor showing populated ProKinO ............................62

Figure 18: Knowledge discovery through populated ProKinO ........................................63

Figure 19: A snapshot of elementary browsing of ProKinO ............................................69

1  

CHAPTER 1

INTRODUCTION

1.1 Motivation and Need

There has been a tremendous growth in the information about the structure,

function, interaction and evolution of protein kinases which play a very important role in

cellular function and disease. There are more than 50,000 protein kinase sequences (from

diverse organisms), nearly 500 crystal structures, over 500 non-synonymous mutations,

and tens of thousands of published articles on protein kinases and this list is ever

increasing. Integrating and analyzing these existing data can provide a detailed

understanding of the relationships between sequence, structure, function and disease in

the protein kinase family. However, the data and information about protein kinases

domain is spread across different resources with each resource maintaining its own

repository often in different formats. The difficulty in integrating data from these

disparate sources and heterogeneous data formats, has posed major challenges for the

protein kinase community in utilizing existing knowledge for research related to diseases

like cancer.

Ontologies provide a solution to the above mentioned problem because by

representing data as concepts and integrating data by introducing relationships between

these concepts, ontologies provide a framework for capturing, organizing and

representing knowledge in a way that computers can process and also humans can

2  

understand. The ontologies are frequently used to deal with the heterogeneity of database

schemas of different information sources by providing a shareable, consistent and formal

description of the semantics [1]. A well identified, conceptualized, designed and

populated ontology representing knowledge about a particular domain can be the basis

for a range of applications serving that domain as well as associated areas that are built

upon the knowledge weaved in ontology [2].

There have been many efforts in the recent past in the biomedical community to

represent fundamental, as well as specialized domain knowledge in the form of

ontologies. Many biomedical ontologies are available in the public domain ranging from

the highly developed Gene Ontology (GO) [3] and the Sequence Ontology project [4] to

very domain specific ontologies, such as those maintained under the umbrella of Open

Biomedical Ontologies Foundry [5]. Highly successful biomedical ontologies such as the

Gene Ontology have served as a vehicle for knowledge representation for the biological

community for over a decade. The Gene Ontology has also been used for various

applications including biomedical literature mining and genome annotation.

There are a few ontologies related to the protein families which are present in

public domain such as in Open Biomedical Ontologies (OBO) Foundry but these are,

however, not able to describe the domain of protein kinases completely. Protein kinases

are a large family of proteins that are implicated in many human cancers and are one of

the very few families that have been extensively studied both from the basic and clinical

point of view. Keeping in view of the immense importance of protein kinases in the

protein family, a greater need is observed to have a specialized ontology for protein

3  

kinases. Protein Kinases Ontology (ProKinO) serves as a shared vocabulary to leverage

knowledge that can be used in various useful applications in the protein kinase domain.

1.2 Contributions

There are a number of ontologies in the biological and medical domain which

have performed beyond expectations after their inception. The Open Biomedical

Ontologies (OBO) library contains many of such ontologies which are shared across

different domains. The list of OBO ontologies includes some ontologies which are in

protein related domains such as Protein Ontology (PRO) [6], Protein-protein interaction

[7], and Protein-modification [8], but there exists a gap as far as completely serving the

domain of protein kinases is concerned.

ProKinO is an effort to capture the basic and clinical research data about protein

kinases scattered in disparate and heterogeneous sources and integrate this data in the

form of formal and explicit conceptualization such as ontology. ProKinO has been

developed to capture the current state of knowledge on protein kinase sequence, structure,

function, motif and disease. ProkinO provides a controlled vocabulary of terms, their

hierarchy, and relationships among them in the Protein Kinase domain, unifying

information from heterogeneous resources to provide consistent representation of

sequence, structure, motif, function and mutation data. We have identified entities and

concepts related to the area of interest, i.e., protein kinases in the ProKinO ontology, and

then the relationships between these entities and concepts were identified. ProKinO has

been developed: (i) to formally specify the concepts and their relationships in the domain

of protein kinases to provide a sharable and consistent vocabulary, (ii) to integrate

4  

sequence, structure, function, motif and disease information on protein kinases in a

machine readable format, (iii) to allow protein kinase researchers to navigate diverse

forms of data in one place, (iv) to annotate cancer genomes, mine protein kinase

literature, and to allow the development of other important applications focusing on data

exploration and inference in an efficient way.

1.3 Scope

There are many databases and sources available in the protein kinase and

associated biomedical domain and most of these use different terminology and formats to

serve the community. ProKinO automatically extracts information from diverse sources:

KinBase [9]; Catalogue of Somatic Mutations in Cancer (COSMIC) [10]; Protein Data

Bank (PDB) [11]; Protein Families database (Pfam) [12] and The Universal Protein

Resource (UniProt) [13] to populate it to serve as a compendium of specialized

knowledge about protein kinase domain. ProKinO will help to build software applications

based on the knowledge integrated inside it such as annotating the vast amounts of

sequence data generated from cancer genome sequencing studies and mining the wealth

of literature accumulated on protein kinases. This ontology can become the basis for text

mining the wealth of protein kinase literature by not only giving direct access to the facts

stated in the text but also uncovering the indirect relationship between the entities. The

system will go beyond simple keyword searching and provide the user capability to query

the literature with the knowledge of ontology in a way that the information extraction is

more efficient than simple keyword searching. For instance, the synonyms of protein

kinase genes captured in the ontology with object property hasOtherName (e.g., gene

5  

AMPa1 is also known with other name PRKAA1) can make the information extraction

from literature related to this domain more pertinent. In the same way the ProKinO is

packed with this type of integrated knowledge which can be utilized for mining the

literature more efficiently. So this will be a text mining and advanced search system for

executing highly specialized queries created with the use of ProKinO and over the

electronically available publications in the area of Protein Kinases. We also plan to focus

on the creation of an automated cancer genome annotation system based on ProKinO. To

consistently and accurately annotate protein kinase mutations in upcoming cancer

genome sequencing studies, the integrated knowledge in ProKinO will be used. This will

make it possible to provide a consistent annotation for protein kinase mutations

discovered in cancer genomes and allow cancer researchers to prioritize mutations for

experimental studies.

6  

CHAPTER 2

BACKGROUND AND LITERATURE SURVEY

This chapter describes the background information about the concepts that are

relevant with the design and development of biomedical ontologies. The initial part of the

chapter includes definitions and background information about the Semantic Web vision,

tools and technologies enabling Semantic Web and about ontologies and biomedical

knowledge management. The later part of the chapter focuses on presenting the related

work in the area of biomedical ontologies and their applications.

2.1 Semantic Web Vision

There has been a rapid progress of The World Wide Web since its inception

nearly 20 years ago. The World Wide Web, abbreviated as WWW and commonly known

as The Web, has grown exponentially in size after its coming into light and has changed

at an exponential rate even not imagined by the researchers working in this area [14]. As

of March 2009, the index-able web contains at least 27.08 billion pages [15], Google

Search had discovered one trillion unique URLs [16] and as of March 2010, over 116.9

million websites operated [17]. The Web has matured over the time to be a web of

documents that are readable to humans but difficult for computers to manipulate for

getting the meaningful information. The Semantic Web is a step forward in overcoming

this limitation of Web and of late the attempt is to provide computer processable

7  

meanings to the Web. This will greatly enhance its capability to allow machines to use

the Web content. According to Tim Berners-Lee, the father of the Web, “The Semantic

Web is not a separate Web but an extension of the current one, in which information is

given well-defined meaning, better enabling computers and people to work in

cooperation” [18]. The Web has been envisioned as a universal medium for data,

information, and knowledge exchange by Tim Berners-Lee. To achieve success in the

Semantic Web, some fundamental changes in the structure of current Web must be

brought by giving semantics to the information by inducing concepts and the

relationships among them. Apart from the primary function of giving structure to the web

content, the Semantic Web has been striving for improving the task for data integration

across various Web applications and prompting partnerships.

There are many factors that can contribute to the success of this new approach of

expanding the current Web to give understanding to data. There has been a concentrated

and coordinated effort to make the Semantic Web a reality by different collaborative

working groups which follow certain design principles and use different enabling

technologies under the umbrella of the World Wide Consortium (W3C) [19]. One of

these collaborative groups is the W3C Semantic Web Activity Working Group [20]

which has been working on a number of standards in the Semantic Web area. Many

technologies have already made their place in the realization of Semantic Web and many

more are being added to the arsenal because of the larger interest in Semantic Web

research. Several technologies and tools which are being used by the Semantic Web

community are provided mostly by the open source community. In the literature the

8  

major Semantic Web technologies have been visualized as a layered specification

popularly represented as a Semantic Web Layer Cake.

The various tools, technologies, standards and conventions which are basic

building blocks of the Semantic Web and their layered organization, are shown in the

Figure 1. The architecture has the Uniform Resource Identifier (URI) as the foundation of

the organization with the Extensible Markup Language (XML) on top of that. Further, the

Resource Description Framework (RDF) and RDF Schema, as well as query language

SPARQL and the Web Ontology Language (OWL) are placed in the middle of the cake.

Figure 1: The Semantic Web layered specification [21]

The upper layers consist of logic, proof and trust which are still being explored.

The digital signature (Cryptography) is relied upon by some layers to ensure security.

9  

Some of the important components of the Semantic Web are briefly discussed in the

subsequent sections.

2.1.1 Resource Description Framework (RDF) and RDF Schema

The World Wide Consortium has developed Resource Description Framework

(RDF) as a simple metadata data model for describing and creating relationships among

resources. The Consortium had published a specification of RDF's data model as a W3C

Recommendation in 1999 and the new version was published as a set of related

specifications in 2004. In RDF the information is represented in a minimally restrictive

way and RDF's simplification offers greater sharing. RDF defines a resource as an object

and this object is uniquely identifiable by a Uniform Resource Identifier (URI) which is a

formatted string used for identifying abstract or physical resources. The design of RDF is

based on the goals of having a simple data model, an extensible URI-based vocabulary,

using an XML-based syntax and supporting the use of XML schema data types and

having formal semantics which provide a dependable basis for reasoning about the

meaning of the RDF expressions [22].

The fundamental structure of any expression in RDF is very simple and consists

of a collection of triples. Each triple in RDF is made of a subject, a predicate also called

property and an object. An RDF graph consists of set of such triples and the set of nodes

of an RDF graph is the set of subjects and objects of triples in the graph. Figure 2 depicts

a triple consisting of subject, a gene ACK represented as a URI, having a predicate

hasFunctionalDomain with a property value SH3_2.

 

H

<

<

S

R

re

cl

C

S

co

em

2

h

u

Here is a com

http://om.cs.u

http://om.cs.u

The R

emantic We

RDFS allows

estrictions an

lasses in the

Cancer class

chema can

onstraints c

mbraces man

.1.2 Web On

The W

as made its

sed to bui

mplete RDF t

uga.edu/.../AC

uga.edu /.../SH

Figur

Resource De

eb and can b

s resources to

nd extra rel

e RDF Schem

can be defin

inherit pro

can be appl

ny concepts

ntology Lang

Web Ontolog

place as a s

ild ontolog

triple:

CK><http://o

H3_2>.

re 2: A RDF

escription Fr

be regarded a

o be defined

ationships c

ma can be o

ned as a subc

operties fro

ied to them

from RDFS

guage (OWL

gy Languag

standard ont

gies that a

10 

om.cs.uga.edu

F statement d

ramework S

as an extens

d with classe

can be defin

organized in

class of the D

om other pr

m. The mor

S and is discu

L)

ge (OWL), w

tology langu

are explicit

u/.../hasFunct

depicting a tr

Schema (RD

sible knowle

s, properties

ned, which is

a hierarchic

Disease clas

roperties an

re rich Sem

ussed below

which has b

uage of the S

representa

tionalDomain

riple graphic

DF-S) is a s

edge represen

s and values.

s not the ca

cal fashion.

ss. The prop

nd also dom

mantic Web

.

been recomm

Semantic W

ations of t

n>

cally.

step ahead in

ntation langu

With RDF-

ase for RDF

For example

erties in the

main and r

language O

mended by W

Web. OWL ca

terms and

n the

uage.

-S the

. The

e, the

RDF

range

OWL

W3C,

an be

their

 

in

re

re

O

to

su

ex

p

an

su

as

nterrelationsh

epresent ma

ecommendat

OIL were rev

o be known

ublanguage

xpressive po

Figu

OWL-

ower and it

nd simple co

upporting OW

RDF

s important

hips. The e

achine interp

tions of XM

vised to give

as the Web

groupings (

ower provide

ure 3: Web

-Lite: This i

supports the

onstraints. B

WL-Lite tha

Schema doe

modeling c

expressive p

pretable con

ML, RDF, an

richer know

Ontology La

(shown in F

ed by each g

Ontology La

s the sublan

e community

Because of it

an the other t

es not includ

capabilities.

11 

power of OW

ntent on the

d RDF-S. T

wledge repres

anguage (OW

Figure 3) tha

group.

anguage (OW

nguage of OW

y of users w

ts lesser com

two sublang

de a number

OWL has

WL in term

Web is gre

The Web ont

sentation lan

WL). The O

at are differ

WL) and its

WL which i

who require o

mplexity, it

guages.

of features w

been design

ms of provid

eater as com

tology langu

nguage for th

WL languag

rentiated on

three sublan

s having the

only a simpl

is easier to

which are ge

ned to addr

ding faciliti

mpared to e

uages DAML

he Semantic

ge comes in

the basis o

nguages.

e least expre

le class hiera

provide tool

enerally rega

ess many o

es to

arlier

L and

Web

three

of the

essive

archy

ls for

arded

of the

12  

shortcomings of RDF/RDF-S. For example, the equality claims about instances can be

made in OWL-Lite which is not possible in RDF Schema. For example,

• SameIndividual(instance, instance)

Equivalence between classes and properties can also be described by OWL-Lite which is

not directly available in RDF-S. For example,

• EquivalentClasses(ClassA, ClassB)

• EquivalentProperties(Property1, Property2)

Additional property modeling capabilities are available in Owl-Lite. For example,

one property can be inverse of the other property. A property can be transitive, symmetric

or functional. Functional property is a case of maximum cardinality of 1 for the value of

that property. For example, a person can have one biological mother so (ObjectProperty

hasBiologicalMother Functional). There can be many cardinality constraints that are

allowed in OWL-Lite but these are just special cases of more generalized constraints

allowed in the more expressive sublanguage, OWL-DL.

OWL-DL: This sublanguage of OWL has greater expressive power than OWL-

Lite and gets its name, DL, because of its use of Description Logic. This sublanguage of

OWL supports the user community which requires a highly expressive language that is at

the same time computationally complete and decidable. The computational completeness

means that all computations can be completed in finite time and decidable means that all

conclusions are computable.

OWL-DL further extends the advantages of OWL-Lite over RDF-S. There are

more powerful language constructs than those available in OWL-DL. The equivalence

axiom discussed in OWL-Lite can be stated more strongly with the disjoint construct. For

13  

example, DisjointClasses(ClassA, ClassB) expresses that both classes are disjoint. OWL-

DL removes the restriction on cardinality constraints and we can provide general

cardinality constraints, with values allowed from 0 to n, as compared to only 0/1 values

allowed in OWL-Lite. So an instance of the Parent class can have a property

hasChildern, with cardinality allowed more than 1. With OWL-DL we can use arbitrary

algebraic expressions and it provides the capability of disjunction, conjunction and

negation. Some examples of these functionalities are shown here:

• ComplementOf(InstanceX, InstanceY) - Negation

• SubClassOf(ClassA UnionOf(ClassC,ClassD)) - Disjunction

• SubClassOf(ClassA IntersectionOf(ClassC,ClassD)) - Conjunction

OWL-Full: This sublanguage of OWL has the maximum expressive power among

all OWL sublanguages and is preferred by the user community who want maximum

syntactic freedom of RDF, although it may not be guaranteed for computational

completeness. It is not possible to do automated reasoning on OWL-Full ontologies.

RDF-S is tolerant in the sense that it allows a class to be instance of another class

and even a class to be instance of itself. OWL-Full is similar in providing this capability

that is otherwise not available in OWL-DL and OWL-Lite. However, the practical

realization of OWL-Full is not available to provide different conceptualizations defined

in it in terms of reasoning tools. A snapshot of ontology fragment in OWL is shown in

Figure 4.

Which sublanguage of OWL is to be used depends upon the requirements of the

user. If more expressive power is required, then OWL-DL may be preferable to OWL-

14  

Lite. Owl-Full may be advisable when users require even more meta-modeling

capabilities.

 <owl:Thing rdf:about="#CDK11-G-helix">

<rdf:type rdf:resource="#SubDomainX"/>

<rdfs:label xml:lang="en"

>CDK11 G-helix subdomain</rdfs:label>

<hasEndLocation>236</hasEndLocation>

<hasSubDomainSequence>CIFAELLTSEPIFHC</hasSubDomainSequence>

<hasStartLocation>222</hasStartLocation>

<rdfs:comment xml:lang="en"

>This is a Protein Kinase Gene CDK11 &#39;s sub domain part G-helix</rdfs:comment>

</owl:Thing> 

Figure 4: An ontology fragment in OWL (An excerpt from ProKinO OWL file).

2.1.3 SPARQL (SPARQL Protocol and RDF Query Language)

SPARQL pronounced as “Sparkle” is a query language for the Semantic Web

which is used to query sets of RDF graphs. We can also use it to query with OWL, e.g.

using ARQ. ARQ is a query engine for Jena that supports the SPARQL RDF Query

language. SPARQL is specification provided by the working group of World Wide Web

Consortium (W3C) known as RDF Data Access Working Group (DAWG). At the start of

the year 2008, SPARQL became an official recommendation of the W3C as a standard

ontology query language. SPARQL allows one to fetch values from structured and semi-

15  

structured data, the capability to discover data by querying unknown relationships and

execute composite joins of different databases in a query [23].

The Structured Query Language (SQL) is used to query data from a relational

database and SPARQL resembles SQL to a degree. Although a triple store and a

relational database are basically different, SPARQL has been made similar to SQL to

facilitate the developers familiar with SQL. The relations in the form of tables are the

basis of storing data in relational databases and foreign keys are used to relate the data. In

the case of RDF, URIs are used to link to any other data in any triple store.

SPARQL queries are usually run on endpoints which take queries as input and

produce results in different formats, such as XML, RDF, and HTML. A SPARQL

endpoint is a conformant SPARQL protocol service that enables users (human or other)

to query a knowledge base via the SPARQL language and having results returned in one

or more machine-processable formats [24]. The endpoints for SPARQL are usually

classified as Generic endpoints which can query any Web-accessible RDF data (for

example, dataset from FOAF or from our ProKinO) and Specific endpoints that are meant

for particular datasets (for example, Virtuoso’s DBPedia SPARQL Endpoint is an

endpoint available specifically and tailor-made only for a dataset of DBPedia). Generic

endpoint examples are:

• sparql.org (by HP's ARQ)

• OpenLink's Virtuoso

• Redland's Rasqal.

SPARQL queries are called Graph Patterns because to get data extracted from the

triple store using SPARQL, we define a pattern that matches the statements in the graph.  

 

A

sn

ca

A SPARQL q

napshot depi

Figure 5: A

This q

ancers that g

PRE

SELE WHE

query input

icted in Figu

A view of Op

query in the

gene ABL1 is

EFIX prokino:<

ECT ?Disease

ERE { prokino

}

in Open Lin

ure 5.

pen Link’s V

SPARQL la

s associated

<http://om.cs.u

o:ABL1 prokino

16 

nk’s Virtuos

Virtuoso Gen

anguage show

with in ProK

uga.edu/....../Pr

o:associatedW

so, a Generi

neric Endpo

wn below is

KinO:

rokino/>

With ?Disease.

ic Endpoint,

int (A query

used to find

is shown in

y for ProKinO

d all types of

n this

O).

f

17  

2.2 Ontologies

The term ontology in Computer Science has its origin lying in a branch of

Philosophy called Metaphysics where it is defined as a systematic study of existence [25].

One of the simplest and frequently used definitions of Ontology is given by Thomas

Gruber in his extensively cited paper "Toward Principles for the Design of Ontologies

Used for Knowledge Sharing". He defines ontology as an explicit specification of a

conceptualization [26]. World Wide Web Consortium (W3C) has defined ontology in the

following words: “An ontology defines the terms used to describe and represent an area

of knowledge. Ontologies are used by people, databases, and applications that need to

share domain information (….) Ontologies include computer-usable definitions of basic

concepts in the domain and the relationships among them (…). They encode knowledge in

a domain and also knowledge that spans domains. Ontologies are considered to make

that knowledge reusable” [27].

The reasons for building ontologies in a particular domain are to make common

understanding of structure sharable among different stakeholders in the domain, to enable

reuse of the domain knowledge, to explicitly describe the domain, delineate the domain

knowledge from operational commitment and to explore the domain knowledge [28]. The

ontologies are used as a tool of data integration and knowledge representation in many

disciplines such as the Semantic Web, Artificial Intelligence, Natural Language

Processing, Information Retrieval, Bio-informatics, Software Engineering, Education,

and so forth.

Building ontology is not a trivial task, especially if the ontology has to be truly

representative of the domain of discourse. The task requires relevant data to be extracted,

18  

formalized and then integrated into the ontology. Furthermore, the ontology has to be

accepted by the user community [29]. Ontologies are made up of classes, properties and

individuals to make up a foundation for representing the knowledge.

• Classes in the ontologies represent the main concepts in the field of interest. The

example of a class in a biomedical ontology can be a Mutation class which captures

various mutations in genes as a concept and the specific mutations become the

instances of the Mutation class. There can be a hierarchy of further subclasses under

parent classes for more specialized representation of the domain of discourse, which

is usually described as a taxonomic hierarchy.

• The concepts in themselves are not able to fully describe the domain and we need

more information to describe the internal organization of concepts. The properties in

ontologies are used to capture the various features and characteristics of the concepts

represented in the form of classes. The properties can be used for further

characterizing the instances of classes or showing the relationships between instances

of various classes. For example, the Mutation class can have a property

hasPrimarySite to show that a particular mutation has a primary site of cancer

occurrence, such as “skin” and at the same time can have a property foundIn to show

its relationship with an instance of another class, Gene.

• Individuals are used to represent elements or entities in an ontology. Defining an

individual instance of a class requires selecting a class, forming an individual instance

of that class, and loading the instance values. For example, we can create an

individual instance carcinoma to represent a specific type of Cancer.

19  

Ontologies that are built using OWL-Lite or OWL-DL can be processed by a reasoner

which is also known as classifier. The main functionality provided by the reasoner is to

check whether or not one class is a subclass of another class. These checks can be useful

in determining the inferred ontology class hierarchy. These classifiers can also be used

for checking the consistency of the ontology. Depending upon the conditions described

for a class, the classifier can check whether or not it is feasible for the class to have any

instances. If it is not possible, the class is considered inconsistent by reasoner.

The ontologies are classified into various categories by the stakeholders using

different criteria. The classification can vary in scope and description, level of hierarchy

and the level of formalism. An upper ontology, also known as a top-level ontology,

foundation ontology or generic ontology, is an ontology which describes very general

concepts and that can be applicable to several domains. This type of ontology is created

to support very broad semantic interoperability between a large numbers of ontologies

accessible under it. The Suggested Upper Merged Ontology (SUMO) [30] and its domain

ontologies form the largest formal public ontology that are being used for research and

applications in search, linguistics and reasoning. Basic Formal Ontology (BFO) [31]

grows out of a philosophical orientation and is focused on the task of providing a genuine

upper ontology which can be used in support of domain ontologies developed for

scientific research, as for example in biomedicine within the framework of the OBO

Foundry. IDEAS [32] is a formal, higher-order, 4D (four dimensionalism) upper ontology

which is extensional (see Extension (metaphysics)), using physical existence as its

criterion for identity. This ontology is well suited to managing change over time and

identifying elements with a degree of precision that is not possible using names alone.

20  

Domain ontologies are developed to represent the knowledge about specific domains like

biomedical ontologies. Protein ontology (PRO) is an example of the domain ontology.

The ontology can also be a Task ontology used to accomplish certain tasks. Application

ontologies are used to extend specific applications.

There are various criterions used for building ontologies but all culminate into

mainly two broad approaches. One approach is to reuse the existing ontologies available

in the public domain to share already captured knowledge using variety of techniques.

One technique of reusing ontologies is to include one ontology into the other in the form

of a combined ontology while the other method is to refine a generic ontology to fulfill

the specific needs of the new ontology. So, if possible, the approach is always to consider

reusing the existing relevant ontology instead of duplicating the effort and increasing the

complexity for user community. However, if this is not the case, the best approach is to

build ontology from scratch. In our case as ontology was built to serve a specific domain

of protein kinases that is not covered by any of the already existing ontologies, ProKinO

followed the approach of building ontology by acquiring the data from the disparate

sources and representing the integrated knowledge in it.

Different ontology design principles have evolved over time and are

conventionally followed by developers to produce good quality ontologies. As ontologies

are shared within the domain communities as well as outside, these design principles, if

followed, result in more acceptable and useful ontologies. The ontologies should be

collaborative, extensible, portable, user friendly, modifiable, and well documented.

21  

2.3 Biomedical Knowledge Management

Many knowledge experts make a distinction among data, information and

knowledge in the literature. The data alone may not carry much significance, but when

the same data is given some conceptual context and processed in some meaningful way to

provide useful results, it is termed as information. The knowledge is what we achieve

when some understanding is given to the information. Biomedical scientific data is

available in abundance but to manage the large amount of data, we have to apply the

process of biomedical knowledge management. The Knowledge Management is defined

as a method of gathering, processing, organizing, storing and analyzing information about

a particular domain and then further disseminating this knowledge in the form of useful

applications.

The integration of the data and representation of the data as knowledge is an

essential part of the process of Knowledge Management. Data about a particular domain

may be scattered across the different datasets and knowledge could be inherent in the data

integration and further analysis of that data. For example, data about a certain mutation

related to a gene may be available but extracting the knowledge about the sequence (that

may be present in gene’s sequence data) in which this mutation occurs, may require

processing and inference to produce this piece of knowledge.

Today, computers play a far greater role in acquiring, processing, organizing,

integrating, storing and distributing the knowledge than ever before. The data integration

from different sources is not an easy task as there can be disparity in the definitions of

schema of these sources as well as in the naming conventions [33]. The tools for

managing biomedical knowledge can be based on technologies used in the Semantic Web

22  

that work on the mission of dealing with heterogeneous resources. Ontologies, one of

these technologies and also seen as a central component of the Semantic Web, play a

central role in integrating knowledge. The ontologies support data integration by ways of

Warehousing and Mediation approaches of integrating data [34]. In the Warehousing

approach to data integration a common vocabulary is constructed by standardizing the

data integration from different sources with common format transformation, whereas in

Mediation approach a global schema is designed and mapped to local schemas of sources

[35]. How knowledge is represented also plays a significant role in managing the

knowledge. One of the more popular approaches of representing knowledge is to use

ontologies. Biomedical ontologies are controlled vocabularies for shared use across

different biological and medical domains and can be seen as the tool of data integration

and knowledge representation for managing biomedical knowledge. ProKinO is a

specialized biomedical ontology in this effort, which manages protein kinase domain

knowledge by integrating data about the domain, as well as representing the knowledge

created from this integration of data.

2.4 Biomedical Ontologies

The biomedical ontologies are seen as an answer to the challenges in seamless

integration of data from heterogeneous sources using different schema and naming

methods. They are a very important tool of data integration of basic and clinical research

and representing the knowledge benefitting biomedical communities. The effort that was

initiated by the development of small ontologies using some rudimentary tools by experts

23  

in restricted domains of interest has now advanced into a scenario in which large popular

ontologies are being built by groups of people from various disciplines.

Many endeavors have taken place in the field of biomedical ontologies and

National Center for Biomedical Ontology (NCBO) [36] is one of the efforts dedicated to

this cause. NCBO was formed in 2006 and was funded by National Institutes of Health

(NIH) [37] and a component of National Centers for Biomedical Computing. It has been

described as a consortium of world-class researchers committed to developing technology

and infrastructure that boost biomedical research. The stated vision of NCBO says that all

biomedical knowledge and data are disseminated on the Internet using principled

ontologies in such a way that the knowledge and data are semantically interoperable and

useful for furthering biomedical science and clinical care. They also define their mission

as to create software and support services for the application of principled ontologies in

biomedical science and clinical care, ranging from tools for application developers to

software for end-users. NCBO is involved in the development of ontologies from the

OBO family. Other ontology related research is being conducted both in Europe and the

US. The National Center for Ontological Research (NCOR) was established by The

University at Buffalo and the Stanford University by partnering with a number of

institutions drawn from academia, government, and industry and it is working for the

advancement of ontological investigation and focusing on the development of tools and

procedures for quality assurance of ontologies [38]. Also in place is The European Center

for Ontological Research (ECOR) that was founded at the Saarland University in

Saarbrücken, Germany (in 2004), and represents a new approach in applying ontology to

a variety of problems [39].

24  

2.5 Biomedical Ontologies Applications

Biomedical ontologies are becoming increasingly popular among the biomedical

community with every passing day. Biomedical ontologies are being developed in all

possible types of ontologies including very generic ontologies capturing common high

level concepts, mid-level biomedical ontologies, domain-specific ontologies, and task

specific ontologies. We can find examples of all these types of ontologies in the form of

well established biomedical ontologies available in the public domain. There have been

many collaborative efforts in the field of biomedical ontologies going on to serve the

community. The Open Biomedical Ontologies Foundry collaborative effort and one of its

most significant ontology, the Gene Ontology (GO), are discussed below:

2.5.1 Open Biomedical Ontologies (OBO) Foundry:

The OBO foundry is an open, inclusive and collaborative platform provided to all

those biological researchers and biomedical domain community members who are

involved in the design, development and publishing of ontologies serving the domain.

The goal of OBO is to develop a set of interoperable reference ontologies, validated by

humans, for all major domains of biomedical research [40]. In 2001, an umbrella body

named Open Biology Ontologies Foundry was formed by Ashburner and Lewis to serve

the community of developers and publishers of life-science ontologies. This started as a

joint effort in which a group of OBO ontology developers agreed in advance to embrace

an emerging set of principles spelling out best practices in the development of biomedical

ontologies. These principles require that the ontologies (i) are open and available to be

used by all without any constraint, (ii) can be expressed in a common formal language,

25  

(iii) use relations which are unambiguously defined, (iv) provide procedures for user

feedback and for identifying successive versions, and (v) are developed in a

collaborative effort [41].

As of this writing, OBO foundry contains 8 ontologies under the “OBO foundry

candidate ontologies” category and 86 under the “Other ontologies and terminologies of

interest” category and the list continues to grow. OBO is supported by the National

Centre for Biomedical Ontology (NCBO) through its BioPortal [42]. These ontologies are

becoming widely accepted as a reference in the biomedical domain. There is a continuous

effort from the OBO foundry to improve effectiveness and transparency, and facilitate

interactions among ontology communities in order to help in providing semantic

integration of biological data from the multiplicity of sources generating data and in

promoting reasoning for information extraction from these data.

2.5.2 NCBO BioPortal

Recently, a very useful resource has been made available in the public domain for

accessing biomedical ontologies via Web services and Web browsers for ontologies

developed in OWL, RDF, OBO format and Protégé frames. This tool is provided by

BioPortal (http:// bioportal.bioontology.org) [43], funded by National Institutes of Health

[37], that facilitates the browsing, searching and visualizing ontologies and acts as a

repository of biomedical ontologies. Many Web 2.0 features have been incorporated in it

to make the system behave as a comprehensive ontology repository. The resource is very

useful in the context because it also provides a web interface to support community-based

participation which can play a greater role in evolution and evaluation of ontologies.

26  

BioPortal also provides features for the community to add notes to ontology terms,

mappings between terms and ontology reviews based on criteria such as usability,

domain coverage, quality of content, and documentation and support. BioPortal is

described as ‘one-stop shopping’ that helps not only to programmatically access

biomedical ontologies, but also for providing support to integrate data from a variety of

biomedical resources.

BioPortal contains a large number of ontologies in its repository and the list is

growing rapidly. In March 2008, the repository included 72 ontologies (300, 000 total

classes) which almost doubled within one year to 134 ontologies (680, 000 total classes).

That number has further grown to include 193 ontologies (as April 2010). BioPortal

contains ontologies on a number of subjects, such as anatomy, phenotype, experimental

conditions, imaging, chemistry, health and many more. Metadata collected for each

ontology include keyword terms, text descriptions, version information, release date,

ontology author contact information and links to documentation and the ontology content

and metadata can be updated automatically or by user submission [42]. In addition

BioPortal provides Web services and Web interface from which prior ontology versions

can be accessed and downloaded.

2.5.3 Gene Ontology (GO)

The Gene ontology (GO), is an immensely popular ontology in the biomedical

field. The Gene Ontology project [44] describes its project as a major bioinformatics

initiative with the aim of standardizing the representation of gene and gene product

attributes across species and databases. The GO ontology is used to capture the

 

kn

it

co

2

by

ph

fu

ce

ab

p

m

tr

nowledge ab

t addresses th

Our e

onsidered as

.6 Informat

Protei

y chemical

hosphorylati

unctional ch

ellular locat

bout 518 pro

ercentage of

majority of c

ransduction.

bout biologi

he need for c

endeavor of

s a bio-ontolo

tion Sources

in Kinase: A

lly adding

ion, as dep

hange of the

tion, or asso

otein kinase

f human pro

cellular path

Signal trans

Figure 6: Pr

cal processe

consistent de

f developin

ogy applicat

s in Protein

A “protein ki

phosphate

picted in Fi

e target pro

ociation with

genes and th

oteins may b

hways may

sduction resu

rotein kinase

27 

es, cellular c

escriptions o

g a special

tion serving

Kinase Dom

inase” is a ki

groups to

gure 6. T

otein (substr

h other prot

hey constitut

e modified b

be regulate

ults in a chan

es phosphory

components

of gene produ

lized ontolo

protein kina

main

inase enzym

o them an

he effect o

rate) by cha

teins [45]. T

te nearly 2%

by kinase ac

ed, especial

nge in the ce

ylation (adap

and molecu

ucts in differ

ogy, ProKin

ase and relate

me that chang

nd this pro

of this proc

anging the

The human g

% of all huma

ctivity (up to

lly those inv

ell behavior.

pted from [4

ular function

rent databas

nO, also ca

ed communi

ges other pro

ocess is c

ess is usua

enzyme act

genome con

an genes. A

o 30% of all

volved in s

46]).

s and

es.

an be

ities.

oteins

called

ally a

tivity,

ntains

large

l) and

signal

28  

Highly regulated activity of protein kinases has big effects on a cell. By

controlling protein kinases’s location in the cell relative to their substrates, or by binding

of activator proteins or inhibitor proteins, or small molecules, phosphorylation may turn

them on or off. Deregulated kinase activity is a frequent cause of disease, particularly

cancer, where kinases regulate many aspects that control cell growth, movement and

death. Drugs which inhibit specific kinases are being developed to treat several diseases,

and some are currently in clinical use, including Gleevec (imatinib) and Iressa (gefitinib)

[47] .

In the public domain, there are many database resources available containing

information related to the protein kinase domain. Our ontology, ProKinO, is populated

with data extracted from a number of different publicly available biological database

resources. As mentioned earlier, these sources in the protein kinase community include:

KinBase, Catalogue of Somatic Mutations in Cancer (COSMIC), Protein data Bank

(PDB), protein families database (Pfam), and The Universal Protein Resource (UniProt).

These sources contain a large amount of information about the domain of discourse but

they are stored in different formats and are using different concept specifications.

KinBase: This is an interactive kinase database available at Kinase.com, which

made an effort to provide a platform for a broad analysis of protein phosphorylation in

normal and disease state. The KinBase has defined all kinases into a hierarchy of

classification divided into protein kinase groups, families and subfamilies. This

classification is helpful in evaluating the kinase function and growth by comparison of

related kinases.

29  

COSMIC: The Catalogue of Somatic Mutations in Cancer (COSMIC) is a

database dedicated to hosting the information about the somatic acquired mutations

relating to human cancers. A huge amount of information about mutations is available in

the published scientific literature and COSMIC is a project to combine together

information about publications, samples and mutations in one place.

Pfam: The Protein Family Database (Pfam) is a comprehensive database for

conserved protein families, extensively used by the researchers in the biomedical domain

consisting of collection of multiple sequence alignments and profile hidden Markov

models (HMMs). There are many proteins available in nature as domains mix in different

ways to give a wide range of results. The function of proteins can be better understood

with the identification of domains within these proteins and Pfam HMM is regarded as a

great source in this regard. Each Pfam HMM embodies a protein family or domain.

UniProt: The Universal Protein Resource (Uniprot) is considered as a

comprehensive catalogue of protein sequences and functional annotations. UniProt is a

freely and easily accessible global resource for storing and interconnecting information

from voluminous and disparate sources. The researchers and scientists are greatly helped

by UniProt in conducting interactive and custom-tailored analyses of proteins of interest.

PDB: The Protein Data Bank (PDB) acts as a comprehensive source of three-

dimensional structures of macromolecular complexes of proteins, nucleic acids, and other

biological molecules. The knowledge about the shape of molecules provided by PDB can

be used to understand a structure's role in human health, disease and in drug

development.

30  

The information is extracted from these diverse sources, and then populated into

ProKinO. This includes the protein kinase genes information about their classification in

groups, families and subfamilies, their sequences, FastaFormat, chromosomal position

and the mutation information about the mutations associated with protein kinase genes,

mutation primary sites, primary histology, mutation amino acid, mutation description,

mutation residue, and cancer types. The information about the functional features, such as

the modified residue, signal peptide, topological domain, cellular location, tissue

specificity as well as about the crystal structures and the identification and classification

of functional domains in protein kinase sequences is also captured and represented in

ProKinO.

2.7 Challenges in Integrating Protein Kinase Knowledge

The biological researchers and scientists use specific theories and models for the

data related to their work in the domain of interest and often end up applying different

descriptions of the same data [48]. There has been an exponential growth in the research

and clinical data resulting in many isolated repositories of knowledge designed,

developed and maintained by separate groups working in the same or related area of

interest. Whenever a need makes obligatory for a user to access the information from

these disparate sources, many challenges are posed for utilizing knowledge stored in

them. The researcher has to make customized arrangements for fetching data from

required sources and integrating for the current need fulfillment and later on any new

query will require going through the adaption again. The protein Kinase knowledge is

 

al

so

as

se

fu

fe

d

U

d

re

lso not avai

ources often

For e

ssociated w

equence of

unctional fea

etching infor

atabase for

UniProt for f

ifferent form

equire custom

ilable from

n in heterogen

example, if

with a partic

the gene in

ature (for ex

rmation from

gene and s

functional fe

mats that m

mized proce

Figure 7: C

a single uni

neous data f

we have a

cular mutati

n question i

xample, tissu

m the COSM

sequence da

eatures. The

may be using

essing for ans

Challenges in

31 

ified source,

formats.

a complex q

ion and als

is needed al

ue specificit

MIC databa

ata, from th

e fetched inf

g different

swering the

n integrating

, and has to

query to fin

o informati

long with th

ty) of this g

se for muta

he Pfam for

formation fr

nomenclatur

query, as sh

g protein kin

o be collecte

nd the prot

ion about th

he functiona

gene. This qu

ation data, fr

functional

rom these so

re for same

hown in Figu

nase knowled

ed from diff

tein kinase

he structure

al domain a

uery will re

rom the Kin

domain and

ources will b

e entity and

ure 7.

dge

ferent

gene

e and

and a

equire

nBase

d the

be in

d will

32  

We can very well imagine the challenges considering the voluminous data and

numerous queries that may have to be handled. We will see in later sections how a well

populated ProKino, containing knowledge from various sources, can be helpful in dealing

with these kinds of challenges.

33  

CHAPTER 3

DESIGN OF PROKINO

In this chapter we discuss the design of ProKinO, starting with looking at the

heterogeneous knowledge sources used in ProKinO development such as Kinbase,

COSMIC, Protein data Bank, Pfam and UniProt and then moving to the ProKinO design

discussing ProKinO classes and properties. Finally, we describe the overall architecture

of ProKinO, describing all its components at the end of this chapter.

3.1 Heterogeneous Knowledge Sources

The ProKinO ontology is an attempt to provide a common vocabulary of concepts

and relationships between those concepts about protein kinase domain. Our goal has been

to make ProKino a unified compendium of knowledge, captured from disparate

knowledge sources related to the domain of protein kinase. Here we discuss briefly the

protein kinases and the sources of protein kinase related knowledge, as well as the

information they contain.

3.1.1 KinBase

One of the well recognized sources in the domain of the protein kinases is an

interactive kinase database KinBase from Kinase.com. Kinase.com is produced and

managed by Gerard Manning's lab at the Salk Institute in California. Sucha Sudarsanam,

34  

of the biotech company Sugen developed Kinase.com in 1999 to support the publication

of Sugen's analysis of the protein kinases. The site was designed and enlarged with

several bioinformatics tools to use the data by Jonathan Bingham of Sugen. The site was

further developed in 2002-2003 to support KinBase, an interactive kinase database. The

system was shaped and administered by Glen Charydczak and was designed by Gerard

Manning [49].

There has been a non-redundant set of 518 human protein kinase genes identified

based on a comprehensive approach using human genome analysis in KinBase. This

collection in KinBase is made up of published human genome sequences as well as of

other sequence databases and also including directed cloning and sequencing of

individual genes. The set of protein kinase genes in KinBase includes most human

members of the eukaryotic protein kinase super family, and many atypical kinases and

almost all human protein phosphorylation [50]. A popular poster of human kinome by

KinBase is shown in Figure 8.

Figure 8: Human kinome poster (source: [51]).

35  

The collection of eukaryotic protein kinases is categorized into ten groups. The

classification of protein kinases into groups by KinBase is depicted in Figure 9.

 

Figure 9: The Protein kinases classification in groups by KinBase (Atypical and Other groups not shown) [50].

The ten groups are classified in KinBase as:

• AGC group (including cyclic-nucleotide and calcium-phospholipid-dependent

kinases, ribosomal S6-phosphorylating kinases, G protein- coupled kinases and close

relatives of these kinases)

• Atypical group

• CAMKs (calmodulin-regulated kinases)

• CK1 group (casein kinase 1 and close relatives)

• CMGC group (including cyclin-dependent kinases, mitogen-activated protein kinases,

CDK-like kinases and glycogen synthase kinase)

• RGC group (receptor gyanylate cyclase kinases)

• STE group (MAP Kinase cascade kinases),

• Tyrosine kinase group (TKs) and

36  

• TKL group (Tyrosine kinase like family) - which are a cluster of serine-threonine

kinases resembling TKs.

• Other group: Another extensive, miscellaneous group called 'Other' is also considered

for those proteins that do not fit in any of the predefined sets.

These protein kinase groups are further subdivided into families and sub families in KinBase. This distribution into groups, families and subfamilies is shown in Table 1.

Table 1: Kinase distribution by major groups in human systems (Source & Adaption:[50])

Groups Families Sub Families Human Kinases

AGC 14 21 63

CAMK 17 33 74

CK1 3 5 12

CMGC 8 24 61

Other 37 39 83

STE 3 13 47

Tyrosine Kinase 30 30 90

Tyrosine Kinase-Like 7 13 43

RGC 1 1 5

Atypical-PDHK 1 1 5

Atypical-Alpha 1 2 6

Atypical-RIO 1 3 3

Atypical-A6 1 1 2

Atypical-Other 7 7 9

Atypical-ABC1 1 1 5

Atypical-BRD 1 1 4

Atypical-PIKK 1 6 6

Total 134 201 518

The sequences of all these protein kinase genes in KinBase database are also

available from kinase.com [52]. The data is extracted from this well recognized and the

fundamental source of information on protein kinases.

37  

3.1.2 COSMIC (Catalogue of Somatic Mutations in Cancer)

Catalogue of Somatic Mutations in Cancer abbreviated as COSMIC is the largest

resource for information about the somatic acquired mutations associated with human

cancer used by the biomedical community and available freely in the public domain [53].

Mutations are changes in a DNA sequence caused by transcription inaccuracies in genetic

material, induced cellular processes by the organism itself, damage due to mutagenic

chemicals or viruses or radiations, or errors during meiosis. The mutations are mapped to

a single version of each gene sequence in COSMIC and somatic mutation frequencies are

also made available.

After being launched in 2004, there has been a gigantic increase in the size of the

COSMIC database. Presently (v43, August 2009), COSMIC is said to be holding the

details of 1.5-million experiments performed through 13423 genes in almost 370000

tumours, describing over 90,000 individual mutations. The main two sources of data

inclusion in COSMIC are the publications in the scientific literature and the output from

the large analyses from the Cancer Genome Project (CGP) at the Wellcome Trust Sanger

Institute, UK [54]. This has been stated that most of the world’s literature on the genes

known to have tumor-promoting point mutations is curated in COSMIC [54]. The

curation of fusion events (when two genes fuse together) between multiple genes, which

are also common occurrences in cancer, is also continually updated.

One of the important studies in cancer research is finding the mutated genes that

are implicated in cancer. The COSMIC database is integrated into ProKinO for capturing

the mutation knowledge about genes related to protein kinase so that we can relate

otherwise heterogeneous information to be available in a single integrated representation.

38  

3.1.3 Pfam (Protein Families Database)

Pfam, a wide-ranging database for conserved protein families, was originated in

1998 and since then has been growing with deposition of new protein sequences. Pfam

which is a collection of multiple sequence alignments and profile hidden Markov models

(HMMs) is publically available via servers in the UK [55], Sweden [56] and USA [57].

Pfam’s latest release contains a collection of nearly 12000 families each

represented by multiple sequence alignments and hidden Markov models. There are

domains which are functional regions that form the proteins. Entries in Pfam can be

families which are sets of related proteins; domains which are structural units found in

many protein frameworks; repeats, which are units forming a steady structure when many

copies are there but unsteady in isolation, and motifs which can be a short unit found

outside globular domains. A clan is a higher level grouping of related families, produced

by Pfam, that has come up from a single evolutionary source. The resemblance in tertiary

structures, or, in common sequence motifs when structures are not available give their

evolutionary substantiation.

Pfam families can be categorized in two levels of quality: Pfam-A and Pfam-B.

Entries coming up from the core sequence database, known as Pfamseq, which is built

from the most recent release of UniProtKB [58] at a specific time, are termed as Pfam-A

entries. To cover more known proteins alongwith Pfam-A families, Pfam generates Pfam-

B families using the Automatic Domain Decomposition Algorithm (ADDA) database

[59]. Pfam-B families have no associated annotation or literature reference and are of

much lower quality than Pfam-A families, as their alignments have not been manually

checked by a Pfam curator. If we perform a search on protein kinases against the Pfam

39  

library of HMMs, we can find its domain architecture (for a protein shown in Figure 10)

i.e., find which domains it carries.

Figure 10: Pfam display of protein domain architecture (source: [57]).

The data is extracted from this source about identification and classification of functional

domains in protein kinase sequences to be captured in ProKinO.

3.1.4 Protein Data Bank (PDB)

The Protein Data Bank (PDB) contained 7 structures to start with when it was

founded in 1971, at the Brookhaven National Laboratory. After that, PDB has grown by

large proportions to become a large repository for three-dimensional structures of

macromolecular complexes of proteins, nucleic acids, and other biological molecules.

The knowledge about the shape of the molecules of life is useful in understanding their

working. This knowledge can go a long way in deducing a structure's role in disease and

developing the drugs for these diseases.

There was another step in this connection when The Worldwide Protein Data

Bank (wwPDB) [60] was formed in 2003 to keep a single PDB archive of

macromolecular structural data which is freely and openly available to the global

40  

community. There are many participating organizations of the wwPDB. These

organizations are basically the centers that collect, process and distribute the PDB data.

wwPDB was joined by The Biological Magnetic Resonance Data Bank (BMRB) [61] in

the year 2006. Many utilities are extended by wwPDB with providing databases and

websites that can be helpful for obtaining different views and analysis of the structural

data contained within the PDB archive.

PDB has many tools and resources available for searching based on annotations

relating to sequence, structure and function and then using the results further for analysis

or visualization. The curation and annotation of PDB data is done as per the agreed

principles by the PDB that supports a website to perform simple and complex queries on

the data, analyze, and visualize the results. The crystal structure of a protein kinase gene

product EGFR (whose PDBId is given by the property hasStructure in ProKinO) is

shown in Figure 11.

Figure 11: Crystal structure of protein kinase gene product EGFR (ProKinO PDBId:

1IVO, source: [62]).

41  

Information is extracted from the Protein Data Bank to be included in ProKinO

about the crystal structures related to protein kinases and PDB database references for

that structure.

3.1.5 The Universal Protein Resource (UniPort)

There has been tremendous growth of the proteomics data and also soaring

production of genome sequencing data in the recent past. This has resulted in a huge

wealth of protein sequences and associated data for a large number of organisms. There

was an observed need for a kind of storehouse that can serve as a global collection of

protein sequences with all-inclusive coverage and a methodical approach to protein

annotation, incorporation, integration and standardization of data from the various

sources and UniProt was an initiative in this direction [63].

The Universal Protein Resource (UniProt) is supported by the UniProt

Consortium, a collaboration between the European Bioinformatics Institute (EBI), the

Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR).

UniProt is mainly supported by the National Institutes of Health (NIH). The mission of

UniProt is to offer the scientific community a wide-ranging, high quality and freely

accessible resource of protein sequence and functional information.

The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt

Archive (UniParc), The UniProt Metagenomic and Environmental Sequences (UniMES),

and the UniProt Reference Clusters (UniRef). UniProt Knowledgebase (UniProtKB) is

the database for broad curated protein information, including function, classification, and

cross-reference. UniProtKB is further made up of UniProtKB/Swiss-Prot and

42  

UniProtKB/TrEMBL. The first one is manually annotated and is reviewed and second

one is automatically annotated and not reviewed. Sequences and their identifiers are

stored in the UniProt Archive (UniParc). The UniProt Reference Clusters (UniRef)

databases help to obtain complete coverage of sequence space at several resolutions while

hiding redundant sequences by presenting clustered sets made of sequences from the

UniProtKB and selected UniProt Archive records [64]. The UniProt Metagenomic and

Environmental Sequences (UniMES) database is for metagenomic and environmental

data. These databases are shown in Figure 12.

Figure 12: The Universal Protein Resource (UniProt) databases [65].

Significantly, the UniProt consortium in 2008 completed the first draft of the

complete human proteome in UniProtKB/Swiss-Prot, thereby providing manually

annotated representation of all presently known human protein-coding genes in UniProt

release 14.0 with 20325 entries [63].

43  

Information extracted from UniProt and represented in ProKinO includes

functional features, such as the modified residue, signal peptide, topological domain,

cellular location, tissue specificity, organism, and database cross references for

Wikipedia, PubMed, MEDLINE and UniProt.

3.2 ProKinO Design

Ontology development is a complex process involving a series of systematic and

well organized steps in the process. This is not merely a technological process but

requires a deep understanding of the area about which the knowledge is being represented

in the ontology. The ProKinO realization includes following certain principles for its

design and development. One important aspect of ontology design is to keep in mind the

interests of the intended user community and that the design must incorporate the needs

of user in its final requirements.

In the case of ProKinO, the golden rule of reusing the existing ontologies in the

public domain for specialization or generalization was considered. However, we

recognized the need for a knowledge repository serving the specific domain of protein

kinase due to their important role in diseases like cancer. The domain expertise was

provided by the Evolutionary Systems Biology Group (ESBG) in the Biochemistry and

Molecular Biology department at The University of Georgia that is researching

extensively in the area of protein kinases, with the help of the protein kinase community.

On the other hand, the technical facilitation was extended by the Large Scale Distribution

Information Systems Lab (LSDIS) in Computer Science department at The University of

Georgia. The specification and conceptualization of ProKinO was finalized after a long

44  

deliberation period during which ProKinO has undergone numerous revisions. In the

subsequent sections, we discuss the current specification of ProKinO design. However,

the ontology may be revised in the future, as the requirements of user community may

evolve.

3.2.1 ProKinO Classes

Ontologies are made of components namely classes (slots), properties

(relationships) and individuals (members or instances) to fully capture the knowledge

about a particular domain. The steps in ontology construction include developing the

class hierarchy and defining a number of properties and relationships characterizing the

classes. The main concepts in ProKinO have been defined as major classes and are

shown in Figure 13.

The classes defined in ProKinO are used to represent the sequence, structure,

function, sub-domain, and mutation data. We used a combination of top-down and

bottom-up approaches to first define some main concepts and then specialize or

generalize them, depending on further needs. All the major classes and sub classes

specified in the design of ProKinO are shown in Table 2.

45  

Figure 13: Conceptualization of ProKinO (illustrating only a part of overall ontology) 

46  

Table 2: Classes in ProKinO.

Class Name Sub Class Name (Level-1)

Sub Class Name (Level-2)

Sub Class Name (Level-3)

DbXref - - - The DbXref class is used to represent the various database cross references for the instances stored in the Gene class.

This class contains cross references for the UniProt, PDB, PubMed, MEDLINE, Wikipedia and KinBase databases.

Disease Cancer - - The Disease class represents the different diseases with which protein kinase genes are associated with. The Disease

class have these diseases represented under further sub classes of diseases (Cancer, initially).

FunctionalDomain - - - The FunctionalDomain class represents the functional domains identified and classified in protein kinase sequences.

FunctionalFeature ModifiedResidue

SignalPeptide

TopologicalDomain

TransmembraneRegion

-

-

The FunctionalFeature class represents data about the various functional features related to the protein kinase genes

for which sub classes of ModifiedResidue, SignalPeptide, TopologicalDomain, TransmembraneRegion have been

designed to represent the different groups of functional features of gene.

Gene - - - The Gene class in ProKinO represents all the protein kinase genes which are classified into the protein kinase groups,

families, and sub families.

Mutation - - - The Mutation class is used to represent all the somatic cancer mutations that are linked with the protein kinase genes.

Organism - - The Organism class represents all the organisms (Human, initially) for the protein kinase genes.

ProteinKinase (groupname) group (familyname) family (subfamilyname) subfamily

The ProteinKinase class represents classification of the protein kinase genes by having the groups of genes represented

by the ProteinKinaseGroup as a sub class under it. The family of protein kinase genes is represented by the

ProteinKinaseFamily sub class designed under the ProteinKinaseGroup class. Similarly the sub family of protein

kinase genes is represented by the ProteinKinaseSubFamily sub class designed under the ProteinKinaseFamily class.

Sequence - - - The Sequence class of ProKinO represents the sequences for the protein kinase genes.

Structure - - - The Structure class of ProKinO represents crystal structures of the proteins kinase genes.

SubDomain SubDomain(I to XI) - - The SubDomain class represents the motif data related to the protein kinase genes with sub classes of SubDomainI to

SubDomainXI under it capturing the various motifs related to the genes.

47  

3.2.2 ProKinO Properties

Once we have defined the major classes in the ontology, the next step is to

describe additional specifics of the concepts by defining their relationships in the form of

object and data properties in ProKinO. The object properties capture the relationships

between the members of various classes to make the collected information integrated as

per the agreed conceptualization. How a gene in the Gene class is related with a disease

in the Disease class is formalized with the object property associatedWith. On the other

hand, the correlatesTo property connects a Disease member, i.e. a particular disease

(cancer for our purposes), to the gene in the Gene class. These are descriptions of only a

few object properties discussed above. The full list of object properties conceptualized in

ProKinO is given in Table 3.

As we need information to describe the internal organization of concepts captured

in ontology, the data properties in ProKinO are defined apart from object properties.

There are many properties of classes considered, such as what type of values, permissible

values, and cardinality while defining relationships. The property hasFASTAFormat is

defined as a data property in ProKinO to represent the FASTA-formatted sequences of all

protein kinase genes, and it is of data type string. Similarly, there exist another property,

hasOtherName, which is used to store the other names by which a gene may be known in

literature (synonyms). For example, a protein kinase gene ADCK3 has four other names

CABC1, COQ8, MGC4849 and LOC56997. There is a large list of data properties

defined in ProKinO as shown in Table 4.

48  

Table 3: Object Properties in ProKinO.

Object Property Domain (Class) Range (Class)

associatedWith Gene Disease The object property associatedWith represents the relationship between the instances of the Gene class with the

instances of the Disease class. This relationship represents how the protein kinase genes are associated with the

different diseases. For example, the gene “ABL1” is associated with cancer disease “giloma”.

contains Sequence Mutation The object property contains represents the relationship between the instances of the Sequence class with the instances

of the Mutation class. This relationship represents which sequences of the protein kinase genes contain which

mutations represented in Mutation class. For example, the gene sequence “Seq-ABL1” (sequence of the gene “ABL1”)

contains mutation “Q252H”.

correlatesTo Disease Gene The object property correlatesTo represents the relationship between the instances of the Disease class with the

instances of the Gene class. This relationship represents how a disease correlates to the protein kinase genes. For

example, the cancer disease “giloma” correlates to the genes “ABL1”, “FYN”, “TBK1” and many others.

foundIn Mutation Gene The object property foundIn represents the relationship between the instances of the Mutation class with the instances

of the Gene class. This relationship represents what mutations are found in which protein kinase genes. For example,

mutation “L387M” (identified by mutation id 12624) is found in the gene “ABL1”.

hasFunctionalDomain Gene FunctionalDomain The object property hasFunctionalDomain represents the relationship between the instances of the Gene class with the

instances of the FunctionalDomain class. This relationship represents which protein kinase gene has which functional

domain. For example, the gene “ABL1” has the functional domains of “Pkinase” and “Pkinase_Tyr”, “F-actin_bind”

and “SH3_1”.

hasFunctionalFeature Gene FunctionalFeature The object property hasFunctionalFeature represents the relationship between the instances of the Gene class with the

instances of the FunctionalFeature class. The FunctionalFeature class further has sub classes such as

ModifiedResidue, SignalPeptide, TopologicalDomain, and TransmembraneRegion. This relationship represents which

protein kinase gene has which functional features represented in these FunctionalFeature sub classes. For example, the

gene “ABL1” has a functional feature of “phosphoserine”.

hasFunctionalRelationship FunctionalDomain Gene The object property hasFunctionalRelationship represents the relationship between the instances of the

FunctionalDomain class with the instances of the Gene class. This relationship represents which functional domain has

a functional relationship with which protein kinase genes. For example, the functional domain “F-actin_bind” has a

functional relationship with the genes “ABL1” and ABL2”.

hasGene ProteinKinase Gene The object property hasGene represents the relationship between the instances of the ProteinKinase class with the

instances of the Gene class. This relationship represents which protein kinase genes represented in groups, families or

49  

subfamilies of protein kinases are related with the genes represented in the Gene class.

hasMEDLINEId Gene DbXref The object property hasMEDLINEId represents the relationship between the instances of the Gene class with the

instances of the DBXref class related to the MEDLINE cross references. This relationship represents the protein kinase

genes’s MEDLINE database cross references. For example, the gene “ABL1” has the MEDLINE ids “93101588”,

“95199229” and many others.

hasMutation Gene Mutation The object property hasMutation represents the relationship between the instances of the Gene class with the instances

of the Mutation class. This relationship represents which protein kinase genes are having which mutations. For

example, the gene “ABL1” has the mutations “Q252H”, “R47G” and many others.

hasMutationDbXref Mutation DbXref The object property hasMutationDbXref represents the relationship between the instances of the Mutation class with

the instances of the DBXref class. This relationship represents a mutation’s Mutation (NM) database cross reference.

For example, the mutation “Q456Q” (mutation id 1110) has Mutation database cross reference “NM_004333”.

hasPDBId Gene DbXref The object property hasPDBId represents the relationship between the instances of the Gene class with the instances of

the DBXref class related to the PDB cross references. This relationship represents the protein kinase genes’s PDB

database cross references. For example, the gene “ABL1” has a PDB id “1OPL”.

hasPfamId Gene DbXref The object property hasPfamId represents the relationship between the instances of the Gene class with the instances of

the DBXref class related to the Pfam cross references. This relationship represents the protein kinase genes’s Pfam

database cross references.

hasProteinKinaseDbXref ProteinKinase DbXref The object property hasProteinKinaseDbXref represents the relationship between the instances of the ProteinKinase

class with the instances of the DBXref class related to the protein database cross references. This relationship

represents protein kinase genes’s KinBase database cross references.

hasPubMedId Gene DbXref The object property hasPubMedId represents the relationship between the instances of the Gene class with the

instances of the DBXref class related to PubMed cross references. This relationship represents protein kinase genes’s

PubMed database cross references. For example, the gene “ACK” has the PubMed ids “16641997”, ”17344846”,

”15951569” and many other.

hasSequence ProteinKinase Sequence The object property hasSequence represents the relationship between the instances of the ProteinKinase class with the

instances of the Sequence class. This relationship represents which protein kinase genes under the groups, families or

subfamilies of the protein kinases are having which sequences stored under the Sequence class.

hasStructure Gene Structure The object property hasStructure represents the relationship between the instances of the Gene class with the instances

of the Structure class. This relationship represents which protein kinase gene has which crystal structure. For example,

the gene “ABL1” has a structure “1OPL”;

50  

hasSubDomain Gene SubDomain The object property hasSubDomain represents the relationship between the instances of the Gene class with the

instances of the SubDomain class. This relationship represents which protein kinase gene has which sub domains that

are represented under various sub classes (SubDomain I to IX) of the SubDomain class. For example, the gene “ABL1”

has sub domains for all the sub domains from SubDomain I to IX such as for G-helix, the sub domain “ABL1-G-helix”

has a start location of 446, end location of 460 and subsequence “VLLWEIATYGMSPYP”.

hasUniProtId Gene DbXref The object property hasUniProtId represents the relationship between the instances of the Gene class with the

instances of the DBXref class related to the UniProt cross references. This relationship represents the protein kinase

genes’s UniProt database cross references. For example, the gene “ACK” has the UniProt ids “Q07912”, ”Q6ZMQ0”,

”Q8N6U7” and many other.

hasWikipediaId Gene DbXref The object property hasWikipediaId represents the relationship between the instances of the Gene class with the

instances of the DBXref class related to the Wikipedia cross references. This relationship represents the protein kinase

genes’s Wikipedia database cross references. For example, the gene “MUSK” has a Wikipedia id “Musk_protein”.

implicatedIn Mutation Disease The object property implicatedIn represents the relationship between the instances of the Mutation class with the

instances of the Disease class. This relationship represents which mutations are found implicated in which diseases

(Cancer initially). For example, the mutation “L387M” (identified by the mutation id 12624) is implicated in the

disease “haematopoietic_neoplasm”.

locatedIn Mutation Structure The object property locatedIn represents the relationship between the instances of the Mutation class with the instances

of the Structure class. This relationship represents which mutations are located in which structures of the protein

kinase genes. For example, the mutation “L387M” (identified by the mutation id 12624) is located in the

structure”1OPL “.

occursIn Mutation Sequence The object property occursIn represents the relationship between the instances of the Mutation class with the instances

of the Sequence class. This relationship represents which mutations are found in which sequences of the protein kinase

genes. For example, the mutation “L387M” (identified by mutation id 12624) is found in the sequence “Seq_ABL1”

i.e.

"MGQQPGKVLGDQRRPSLPALHFIKGAGKKESSRHGGPHCNVFVE………………………………………………

…………………………….WEMERYTFCVSYVDSIQQMRNKFAFREAINKLENNLRELQICPATAGSGPAATQD

FSKLLSSVKEISDIVQR" (Complete sequence not shown)

 

 

 

 

51  

Table 4: Data Properties in ProKinO.

Data Property Domain (Class) Range (Literal)

chromosomalPosition Gene String The data property chromosomalPosition represents the chromosomal position of the protein kinase genes in the Gene

class and it takes the string value. For example, the gene “ABL1” has chromosomal position “9q34.2”.

hasCancerType Mutation String The data property hasCancerType represents types of cancers, a particular mutation in the Mutation class has and this

property takes the string value. For example, the mutation “S605F” (mutation id 1135) has a cancer type “malignant-

malenoma”.

hasCellularLocation Gene String The data property hasCellularLocation represents the cellular location of the protein kinase genes in the Gene class

and it takes the string value. For example, the gene “ABL1” has a cellular location “Cytoplasm”.

hasChromosome Mutation String The data property hasChromosome represents the chromosome of a particular mutation in the Mutation class and it

takes the string value. For example, the mutation “S605F” (mutation id 1135) has chromosome “7”.

hasStartLocation and hasEndLocation

FunctionalFeature SubDomain

FunctionalDomain

The data properties hasStartLocation and hasEndLocation are used to represent the start and end locations of the

different functional features represented in the FunctionalFeature class and the subclasses under this class. These

properties are also used to represent the start and end locations of the different instances in the SubDomain class. These

properties take the string values. For example, the functional feature ADCK1-signalpeptide has a start location “1” and

an end location “17”. ACK-B3-strand sub domain has a start location “173” and an end location “187”.

hasFASTAFormat Sequence String The data property hasFASTAFormat represents the FASTA format of the sequences of the protein kinase genes and it

takes the string value. For example, the gene “ABL1” has a FASTA format sequence:

"MGQQPGKVLGDQRRPSLPALHFIKGAGK…………………………………………….GPAATQDFSKLLSSVKEI

SDIVQR" (Complete sequence not shown)

hasMutationAA Mutation String The data property hasMutationAA represents the mutation Amino Acid of a particular mutation in the Mutation class

and it takes the string value. For example, the mutation “S605F” (mutation id 1135) has mutation AA as “p.S605F”.

hasMutationDescription Mutation String The data property hasMutationDescription represents the mutation description of a particular mutation in the Mutation

class and it takes the string value. For example, the mutation “S605F” (mutation id 1135) has mutation description as

“substitution-Missense”.

hasMutationId Mutation String The data property hasMutationId represents the mutation id (COSMIC) of a particular mutation in the Mutation class

and it takes the string value. For example, the mutation “S605F” has a mutation id as “1135”.

52  

hasOrganism Gene String The data property hasOrganism represents the organism of the protein kinase genes in the Gene class and it takes the

string value. For example, the gene “ABL1” has an organism “Homo Sapiens”.

hasOtherName Gene String The data property hasOtherName represents the other names (Synonyms) of the protein kinase genes in the Gene class

and it takes the string value. For example, the gene “ABL1” has other names of “p150”, “c-ABL”, “ABL”, “JKT7” and

many others.

hasPosition FunctionalFeature String The data property hasPosition represents the position of the different functional features represented in the

FunctionalFeature class and the subclasses under this class and it takes the string value. For example, the functional

feature “ABL1-phosphoserine” (Modified Residue of the gene “ABL1”) has position “50”.

hasPrimaryName Gene String The data property hasPrimaryName represents the primary or common name (most representative name in literature)

of the protein kinase genes in the Gene class and it takes the string value. For example, the gene “ABL1” has a primary

Name “ABL1”.

hasPrimarySite Mutation String The data property hasPrimarySite represents the primary cancer site of a particular mutation in the Mutation class and

it takes the string value. For example, the mutation “S605F” (mutation id “1135”) has a primary site of ”skin”.

hasPubMedPMID Mutation String The data property hasPUBMEDPMID represents the PUBMED PMID of a particular mutation in the Mutation class

and it takes the string value. For example, the mutation “S605F” (mutation id “1135”) has a PUBMED PMID’s

”16773193” and “15331929”.

hasSubDomainSequence SubDomain String

The data property hasSubDomainSequence represents the subsequence of a particular sub-domain in SubDomain class

for the protein kinase genes and it takes the string value. For example, ACK-B3-strand sub domain has a subsequence

"SGKTVSVAVKCLKPD".

hasTissueSpecificity Gene String The data property hasTissueSpecificity represents the tissue specificity of the protein kinase genes in the Gene class

and it takes the string value. For example, the gene “ABL1” has tissue specificity “Widely expressed”.

 

 

 

 

 

 

 

3

in

ar

on

(C

U

.3 Architect

Fully

ntegrated so

rchitecture o

ntology pop

COSMIC),

Universal Pro

ture of Syst

developed

ource of kno

of systems b

pulation sou

Protein dat

otein Resour

Figure

ems Based o

and populat

owledge Pro

ased on Pro

urces KinB

ta Bank (PD

rce (UniProt)

14: Architec

53 

on ProKinO

ted, ProKinO

otein Kinase

KinO is dep

Base, Catalo

DB), protei

).

cture of syst

O

O is intende

es (initially

picted in Fig

ogue of Som

in families

tems based o

ed to serve

only Huma

ure 14. The

matic Muta

database (P

on ProKinO

as a unified

an). The ov

figure show

ations in Ca

Pfam), and

d and

verall

ws the

ancer

The

54  

Data is acquired from these sources automatically and is analyzed and stored in

ProKinO as classes, properties, and instances. To assure that the ontology is up-to-date

and correct, ProKinO will be continually updated by an automatic population process,

described above. Once developed and verified, ProKinO will be publicly available and

accessible from our Web Portal. ProKinO will also be made available from NCBO

BioPortal after it gets included in the BioPortal. In addition to freely downloadable OWL

files containing ProKinO and its prior versions, the portals will provide access to a live

version of the ontology so that portal visitors can browse and explore the ontology.

Furthermore, we plan to create a Web Service providing a programmatic access to

ProKinO.

55  

CHAPTER 4

PROKINO LIFE CYCLE

Ontology development is a complex task involving a series of systematic and well

organized steps in the process. Ontology Engineering is a relatively new field but there

has already been a lot of work and research in this field during the last decade.

Ontological Engineering refers to the set of activities that concern the ontology

development process, the ontology life cycle, the methods and methodologies for

building ontologies, and the tool suites and languages that support them [66].

Ontologies in the past have been created by different people by following

different methods and methodologies. The type of ontology being developed, the domain

of discourse to be captured, the expertise of the developers of the ontology, all these

factors drive the selection of one method over another while building ontologies.

Further, what type of applications of ontology have been envisioned, the time and

resources available for the task and other related factors also influence the choice of the

approach. Ontology Life Cycle identifies the stages through which the ontology should

go through during its life time. There are certain sets of activities that are performed in

each stage and different models have been proposed by researchers for formalizing how

the different stages can be related in terms of their order. In developing ProKinO, certain

sets of activities which were performed to accomplish the task are discussed in the

subsequent sections.

56  

4.1 Data Acquisition

The data sources of interest to biomedical community are often large, dissimilar

in structure and content, scattered, separately controlled, and rapidly changing [67]. One

of the vital steps in building ontologies is to acquire the knowledge from these kinds of

sources that is to be integrated and represented in the ontologies. This poses several

challenges in information extraction and knowledge acquisition from these disparate

sources. In the development of ProKinO we faced similar challenges in acquiring the data

from the identified sources of the domain knowledge. These data sources, such as

KinBase, COSMIC, PDB, Pfam, and UniProt, provide a range of information related to

the protein kinases but they are stored in different formats and controlled and maintained

by different groups.

We used some open source tools for the overall development process of

ProKinO. We have used Jena API as software support to build and populate ProKinO.

This is an open source tool which was developed by a team at HP lead by Brian Mcbride

and is widely used for ontology development. There is another useful API, called OWL

API, is available for building ontologies, but we used Jana API as it is more mature and

stable. The Jena Ontology API is language-neutral and the Java class names do not

mention the underlying language. For example, the OntClass Java class can represent an

OWL class, RDFS class, or DAML class and to represent the differences between the

various representations, each of the ontology languages has a profile, which lists the

permitted constructs and the names of the classes and properties [68] . Jena is very easy

to use and can be used for parsing, creating and searching RDF models. The key RDF

package for the application developer is com.hp.hpl.jena.rdf.model. The API has been

57  

defined in terms of interfaces so that application codes can work with different

implementations without change. Jena includes interfaces for model, statement, resource,

property, object, literal etc. An example for Jena API used in Java code to create an

empty ontology model is shown in Figure 15.

 public class CreateProkinoBaseModel { 

/** * This method is used to create an Ontology model * @param owlFileName The name of the file in which OWL file has to be stored for the *  model that is created. * @return The ontology model. */ 

 public static OntModel createBaseModel( String owlFileName) { 

OntModel basemodel; // create an empty model basemodel = ModelFactory.createOntologyModel( OntModelSpec.OWL_MEM_RULE_INF); PrintStream out; 

 try { 

out = new PrintStream(new FileOutputStream(owlFileName)); basemodel.write(out); 

}    

catch (FileNotFoundException e) { 

// TODO Auto‐generated catch block e.printStackTrace(); 

} return basemodel;    }  // method createBaseModel end 

} // CreateProkinoBaseModel class end

Figure 15: Jena example to create an ontology model.

Protégé is a free, open-source software system that has become a very popular

tool for constructing knowledge-based applications with ontologies which provides a

suite of tools to the user community. There are many knowledge-modeling compositions

and procedures implemented by Protege which facilitate in creation, visualization, and

58  

manipulation of ontologies in various representation formats. Protégé can be extended by

way of a plug-in architecture and a Java-based Application Programming Interface (API)

for building knowledge-based tools and applications [69].

The Protégé platform supports two main ways of modeling ontologies:

The Protégé-Frames editor is a way of modeling ontologies that provides support to

users in constructing and storing frame-based domain ontologies, customizing data entry

forms, and entering instance data by presenting a full-fledged user interface and

knowledge server. Protégé-Frames use the Open Knowledge Base Connectivity protocol

(OKBC) to implement a knowledge model. The Protégé-OWL editor is an expansion of

Protégé that supports the Web Ontology Language (OWL) which is the standard ontology

language recommended by the World Wide Web Consortium to prop up the Semantic

Web vision. The Protégé-OWL editor provides users support to load and save OWL and

RDF ontologies, edit and visualize classes, properties, define logical class characteristics

as OWL expressions and execute reasoners such as description logic classifiers [69].

For acquiring data from different sources we have written customized software in

Java which performs many required functions in a single unified manner. This software

first of all automatically fetches the relevant data from KinBase about Protein kinase

genes and the associated attributes and after processing the data populates ProKino. Then,

this populated information in our ontology further becomes the basis of acquiring and

parsing the data from other sources of knowledge. The software then automatically

retrieves the needed information by running certain BLAST searches against the protein

kinases data provided by ProKinO to produce pertinent data about functional domains,

motifs, structures, and functional features. These whole sets of data are processed and

 

p

ex

d

k

d

pr

an

u

so

arsed furthe

xample, the

atabase by r

inases. Thes

efined for th

rotein kinas

nd then fetch

sing already

ources for Pr

r by our soft

BLAST sof

running BLA

se files are p

hat relevant

es is acquire

hing all mut

y available k

roKinO is sh

Figur

ftware to extr

ftware fetche

AST automa

parsed to cre

part of kno

ed by retriev

tation related

knowledge i

hown in Figu

re 16: Data a

59 

ract needed

es large num

atically to ge

eate the ProK

owledge. Th

ving the late

d data for m

n ProKinO.

ure 16.

acquisition fr

knowledge

mber of Pfam

et the functio

KinO class,

he knowledg

est mutation

mutations link

The data ac

from ProKinO

to be stored

m-HMM file

onal domain

properties a

ge about mut

n data dump

ked to the p

cquired from

O sources

d in ProKinO

es from the

n data for pr

and instance

tations relat

s from COS

rotein kinas

m these disp

O. For

Pfam

rotein

e data

ted to

SMIC

es by

parate

60  

For instance, from KinBase, we acquire data regarding kinase genes and their

species, their corresponding groups, families and subfamilies, chromosomal location, and

the synonyms and acronyms used for protein kinase genes in the literature. Likewise,

from PDB, we acquire information which includes the PDB ID, three dimensional

coordinates and structure abstract. From COSMIC, we retrieve information regarding the

mutation location, mutation type, tissue type, and cancer type and literature reference.

From the Pfam database, we acquire information about identification and classification of

functional domains in protein kinase sequences, and from UniProt knowledge is acquired

about functional features of protein kinase genes.

4.2 Data Integration and Ontology Population

After acquiring the data relevant to ProKinO from disparate heterogeneous

sources and processing further that data, we integrate this knowledge in ProKinO and

automatically populate the ontology. Our software populates ProKinO with the acquired

knowledge automatically. To integrate the various classes, i.e. the diverse forms of data

in ProKinO, we have developed object properties and relationships that relate sequence,

structure, function and mutation data in meaningful ways. For example, the 'occursIn'

relationship between Mutation and Sequence classes relates a mutation entry in the

Mutation class with a gene entry in the Sequence class. The other relationships among

various classes further integrate these, otherwise disparate, sources of data with each

other.

The automatic process of data population from the KinBase database integrates

and populates the information about the protein kinase genes and their species, their

corresponding groups, families and subfamilies, the other names by which they are

61  

known across the domain and about the chromosomal position. Also, the knowledge

about the sequences of protein kinase genes with the information about their FASTA

Formats is populated. The acquired knowledge becomes the basis for automatically

populating the ProKinO ontology by inserting the instances of the genes for Protein

Kinase Group, Protein Kinase Family, Protein Kinase Sub-Family classes in Gene class,

instances for Sequence class, and the data properties for the corresponding classes.

The automatic process of mining COSMIC database in the ProKinO ontology

population extracts the information about the mutations related to cancer by capturing

MutationID, MutationAA, Chromosome, Primary site, Pubmed_PMID, Description and

other relevant information from COSMIC. Then, ProKinO is populated with instances of

classes Mutation and Disease along with their corresponding data properties.

Information integrated and populated from UniProt is about functional features

such as the modified residue, signal peptide, topological domain, about cellular location,

tissue specificity and organism and about Wikipedia, PubMed, MEDLINE and UniProt

database cross references. The data property of hasFunctionalFeature is used to capture

functional features for protein kinase genes depicting the modified residue, signal peptide

and the topological domain. The information populated from Protein Data Bank is about

the crystal structures and PDB database references. ProKinO is populated with instances

of class Structure, as well as their corresponding object and data properties.

Pfam and related Hidden Markov Models (HMM) provide information about

identification and classification of functional domains in the protein kinase sequences.

The ProKinO ontology population from the Pfam resource is done by integrating the

information about the functional domains related to protein kinases and then populating

 

P

pr

in

E

th

fo

roKinO with

roperties. A

n Figure 17.

Fig

Example Rev

Earlie

he challenge

ormats. Now

h instances o

snapshot of

gure 17: A s

visited

er in one of

es faced in

w, once hav

of class Fun

f the populat

snapshot of P

the previous

integrating

ving the kno

62 

nctionalDom

ted ontology

Protégé edito

s sections, w

information

owledge int

main, along w

y displayed in

or showing p

we have disc

n from dispa

egrated in p

with its corr

n the Protég

populated Pr

cussed an ex

arate source

populated P

responding o

é editor is sh

roKinO

xample to po

s using diff

ProKinO, we

object

hown

ortray

ferent

e can

 

re

th

w

fu

ha

evisit the sam

hat we have

we can get th

urther we

asFunctiona

me example

a mutation

he ProteinKin

can get

alDomain ob

Figure 18:

query to sho

(33750 in th

nase gene w

the funct

bject propert

Knowledge

63 

ow the solut

his example

with foundIn

tional dom

ty as PH, Pk

e discovery th

tion provided

e) stored as a

object prope

mains relate

kinase, Pkina

hrough popu

d by our ont

a member o

erty of ProK

ed with t

ase_C and Pk

ulated ProKi

tology. Assu

f class Muta

KinO ( AKT1

this gene

kinase_Tyr.

inO

uming

ation,

) and

with

64  

The sequence and structure of this gene associated with the in question mutation

can be retrieved as Seq_AKT1 and 3CTQU respectively with properties hasSequence and

hasStructure. The functional feature of cellular location of this gene is provided by

hasCellularLocation as Cytoplasm. This example of knowledge discovery is depicted in

Figure 18.

We should keep in mind that this output of the whole linked information provided

as a unit is basically coming from disparate and separately formatted sources, for which

otherwise we would have to use customized processes to obtain the same results.

Clearly, we see that although the information required may be stored in many

heterogeneous sources in the original databases yet we can get the specific answers to the

queries by navigating through the classes and properties in ProKinO. We believe that

ProKinO will emphasize the usefulness of this type of integration of knowledge in the

form of ontology.

4.3 Curation and Ontology Modification

The ontologies are bound to evolve with the time because with any changes in the

sources of knowledge in the domain, suitable changes to the schema of representation of

knowledge will have to be incorporated, to make the ontology up-to-date and consistent.

Curation: To assure that the ontology is up-to-date and correct, ProKinO will be

continually updated by the automatic population processes described above.

Furthermore, to assure the maximum correctness of the ProKinO data, a domain expert

will be designated to monitor and verify the ontology updates. The ontology curator will

also be charged with introducing any needed schema corrections and its extensions.

Scientists working in the area of protein kinases will be able to submit relevant data for

65  

inclusion in ProKinO, using a convenient GUI-type interface. The curator will be

charged with reviewing such submissions and adding such entries upon approval.

4.4 ProKinO Revisions

As an ontology undergoes necessary modifications, the newly updated ontologies

are saved as versions. Ontology versioning is the concept of keeping multiple versions of

ontologies to mange changes and evolution in ontologies. The compatibility of versions

must be checked for instance-data preservation (no data lost between versions unless

explicitly warranted), ontology preservation (query is satisfied in both versions),

consequence preservation (all the facts could be inferred equally from the new version as

inferred from older one) and consistency preservation (no logical inconsistencies) [70].

As ProKinO is an integration of knowledge from disparate sources and the

integration is done without modifying the original sources, it has to be in consonance

with all these dynamic sources that are subject to frequent modifications. For the

knowledge integrated in the ontology to be current and consistent with the existing data

available in the parent sources and to make any changes in the conceptualization and

specification, ProKinO will be subjected to regular revisions, as agreed upon by the

community. We will be keeping the different versions of ProKinO along with the

information about the differences among these versions. The ontology lifecycle will be

tracked by a versioning system and any prior versions of ProKinO will be easily

accessible.

66  

4.5 Ontology Dissemination and Evaluation

When the ontologies are disseminated in the form of applications based on the

knowledge contained in them, they can be considered as good quality and valuable

ontologies only if they are able to serve the intended purposes. To look for ontologies that

can be viewed as complete is actually an unrealistic goal. As ontology building is an

expensive and time consuming process. Once the ontology is developed by following the

selected development approaches, it must be evaluated for its quality by following certain

evaluation criteria.

An ontology is as good as the knowledge it contains, so for evaluating specialized

ontologies, their evaluation should be based on the specific needs of the users of that

domain. The simple measures of precision and recall are not that easy to be applied to

ontologies as is the case of other knowledge extraction methods. In case of specialized

ontologies these metrics may assume different notions in different kinds of applications.

Specialized ontologies, such as ProKinO, can be evaluated on the basis of satisfying the

specified needs of intended users. So, we tested and performed an evaluation of ProKinO

based on the requirements of the protein kinase community.

The domain experts in Evolutionary Systems Biology Group (ESBG) lab at The

University of Georgia evaluated the ontology by designing their specific queries

dependent on the knowledge existing in it and then checking the results manually. As this

was a basic evaluation conducted by a subset of user community we expect the ontology

to be further evaluated on the basis of usage by wider protein kinase community while

ProKinO being available in public domain. Every ontology has its ultimate evaluation of

quality and success based on whether it is used and accepted widely by the community or

67  

not. Therefore, we are providing a mechanism, so that ProKinO is easily available to all

through our web service. To further strengthen the process of evolution and refinement,

our community will be provided the facility to give its feedback for further improvement.

68  

CHAPTER 5

POTENTIAL APPLICATIONS OF PROKINO

We envision a set of semantic, ontology-based bioinformatics applications

utilizing knowledge represented in ProKinO. We plan to create an ontology browsing

visualization tool, available via a standard Web browser. Also, we are going to use

ProKinO in two major applications. First, to mine the wealth of scientific literature data

that is accumulating on protein kinases and second, to annotate the vast amounts of

sequence data generated from cancer genome sequencing studies. A variety of other

applications are possible, as well.

5.1 ProKinO Browsing

For ontologies to be fully adapted by a user community the knowledge contained

in them must be accessible by simple means of browsing through the concepts and

relationships of ontologies. There are many editors (e.g. Protégé) which are providing

the very basic browsing about going through the ontology constituents. But it has been

seen that most website based ontology editors use separate HTML pages not just for each

entity, but for each view of those entities and this distances the user and the ontology

itself [71]. There are many general purpose ontology browsing tools such as the one

offered by OwlDoc plugin (shown in Figure 19 for ProKinO) available in public domain.

However, there is a distinct need to provide specialized browsers to deal with specific

ontologies. A specialized ontology browser must take into account the specific

 

re

co

a

b

ac

k

fr

elationships

oncepts from

form most s

Figure

We ar

e available v

ccess to the

inase comm

riendly navig

present in t

m the less im

suitable to th

e 19: A snap

re creating an

via a standa

ProKinO kn

munity to po

gational capa

the ontology

mportant one

he intended o

shot of elem

n ontology b

ard Web bro

nowledge bu

ose queries

abilities.

69 

y and disting

es. Furtherm

ontology aud

mentary brow

browsing and

owser. This

ut also certai

for knowle

guish among

ore, the con

dience.

wsing of ProK

d visualizati

tool will no

in specific c

dge extracti

g important

cepts should

KinO (using

on tool for P

ot only prov

capabilities s

ion along w

and often vi

d be visualiz

g OwlDoc).

ProKinO tha

vide fundam

sought by pr

with general

isited

zed in

at will

mental

rotein

user

70  

5.2 Text Mining

The main motivation behind the text mining systems is that most of the world’s

published scientific data is in unstructured or semi-structured form. This becomes more

significant when the researcher or scientist have to deal with huge wealth of literature in

biomedical domain. The overwhelming information bewilders the potential user and he

is not able to keep up with the relevant publications in his own discipline leave aside the

related disciplines. So a constant effort is always neede to search for solutions to absorb

the high flow of new scientific literature. Text-mining tools are becoming indispensable

now for extracting information from the biomedical literature. The natural language

processing field has distinguished between information retrieval and information

extraction with information retrieval said to be recovering a pertinent subset of

documents while information extraction seen as a process of obtaining pertinent

information from documents [72].

Text mining can be defined as a knowledge extracting method to extract useful

and previously unknown information from a document set of texts through the

identification of facts inherent and inexplicit in the data. Biomedical Text mining is based

on using the automated text mining tools for extracting the vast amount of knowledge

existing in the biomedical literature. A biomedical text mining tool includes a component

of text mining which extracts biomedical concepts and entities in the literature and then

the relationships between biomedical entities are detected. Relation Extraction methods

range from co-occurrence, as the simplest way to detect relations between biomedical

entities is to collect texts or sentences in which they co-occur, to patterns, detecting

71  

individual hypothetical instances of relations, which can be aggregated over a corpus, and

then to fuller parsing, producing more elaborate syntactic information [73].

The text mining systems can have a wide scope ranging from very simple tasks of

recognizing named entities and their categorization to more complex tasks of

summarizing, question answering, processing non-textual texts. All these approaches can

be further used for making literature based discoveries that can become the foundation of

new hypothesis to research upon. More recently, the focus has shifted for text mining

systems to address the user specific needs and the specialized applications are developed

for solving the user problems.

We intend to develop a text mining system which would allow scientists to

formulate advanced search queries, unlike the typical “bag of terms” queries adopted by

most search engines today. In essence, scientists would rely on our system to

automatically integrate the knowledge from several information sources while

formulating an advanced query, which otherwise would be very challenging. Such an

advanced query, represented as a graph, would include concepts (and their synonyms)

and data instances, all semantically interconnected by relevant relationships retrieved

from ProKinO. The query would be used to search for publications in which the concepts

and data items as recognized in a document section would match such a query graph, or

its significant sub-graphs.

5.3 Cancer Genome Annotation

One of our main focuses at present is on creation of an automated cancer genome

annotation system based on ProKinO. Genomic annotation is defined as the process of

72  

marking the genes and other biological features in a DNA sequence. Dr. Owen White was

the first person to develop a software system of genome annotation in 1995. He was the

member of the team at The Institute for Genomic Research [74] that sequenced and

analyzed the first genome of a free-living organism to be decoded.

Cancers arise due to the buildup of mutations in critical genes that change normal

programmes of cell proliferation, differentiation and death [75]. The cancer research has

its focus on finding the mutated genes that are implicated in cancer development. A

‘census’ of cancer genes that are also known as oncogenesis, indicates that mutations in

more than 1% of genes contribute to human cancer and the protein kinase is the domain

most commonly found among known cancer genes [76]. The coding sequences of the

protein kinases make a much larger sample of cancer genome to look into the general

patterns of somatic mutation in human cancers [77].

To consistently and accurately annotate protein kinase mutations in upcoming

cancer genome sequencing studies we introduced a class in ProKinO to provide a short

description regarding the structural location and evolutionary conservation for every

mutation. Once the annotation class is sufficiently populated, we will develop an

application that will essentially transfer information from the annotation class to a newly

identified mutation in cancer genome sequencing studies. We realize that novel mutations

that do not exist in ProKinO cannot be annotated this way. However, by frequently

updating the ontology, we will be able to address this issue. In this way we will be able to

provide a consistent annotation for protein kinase mutations discovered in cancer

genomes and allow cancer researchers to prioritize mutations for experimental studies.

73  

CHAPTER 6

CONCLUSION AND FUTURE WORK

Protein kinases are a large family of proteins that are implicated in many diseases

such as human cancer and have been broadly studied both from the basic and clinical

point of view. In the public domain, there are a few ontologies available that are serving

the domains of protein families, but none of these is directly related to the protein kinases

and there is a need in satisfying the requirements of the protein kinase community. To

fill this existing need and keeping in view the huge importance of protein kinases in

protein family, we have developed a Protein Kinases Ontology (ProKinO) that is a

comprehensive and a specialized ontology for protein kinases.

The data and information about protein kinases domain is spread across several

heterogeneous resources and most of these resources are storing data in different formats

following different schema. There are many challenges faced by the protein kinase

community due to the difficulty in integrating data from these disparate sources and

heterogeneous data formats and that becomes a hindrance in utilizing existing knowledge

for research related to diseases like cancer. ProKinO is an endeavor to deal with this

problem and provide a framework for detailed understanding of the relationships between

sequence, structure, function and disease in the protein kinase family.

ProKinO has been developed to capture, integrate and represent sequence,

structure, function, motif and disease information on protein kinases and provide a

sharable and consistent vocabulary to formally specify concepts and their relationships in

74  

the domain of protein kinases. Our ProKinO population system does this by

automatically extracting information from diverse sources, such as Protein Data Bank

(PDB), Protein Families database (Pfam), KinBase, Catalogue of Somatic Mutations in

Cancer (COSMIC), and The Universal Protein Resource (UniProt) and then integrating

data to automatically populate knowledge in the ontology. The ProKinO then serves as a

knowledge base about the protein kinase domain allowing the protein kinase community

to navigate this specialized knowledge in one place and also building applications of their

interest.

ProKinO has become the basis for ongoing work on developing a text mining

system that mines the wealth of protein kinase literature accumulating constantly due to a

huge growth in the information about the structure, function, interaction and evolution of

protein kinases. The text mining system would allow scientists to formulate advanced

search queries, unlike the typical “bag of terms” queries adopted by most search engines

today. In essence, scientists would rely on our system to automatically integrate the

knowledge from several information sources while formulating an advanced query,

which otherwise would be very challenging.

To provide an elementary access to the ProKinO knowledge along with certain

capabilities specifically sought by protein kinase community, we are building an ontology

browsing and visualization tool for ProKinO. This browsing visualization tool will be

available via a standard Web browser. Through this utility researchers can pose queries

for knowledge extraction along with general user friendly navigational capabilities.

The integrated knowledge in ProKinO will be used to consistently and accurately

annotate protein kinase mutations in upcoming cancer genome sequencing studies. This

75  

way the protein kinase mutations discovered in cancer genomes can be provided a

consistent annotation and cancer researchers can prioritize mutations for experimental

studies.

The OBO foundry establishes the main requirements for joining OBO and one has

to agree to adopt and refine a set of principles that prove effective for ontology

development in serving the biomedical research community. As OBO principles have

been followed in the development of ProKinO and it serves an important domain that is

not already covered in any of the available ontologies, we hope it to be a part of OBO in

the near future. The efforts are underway to get ProKinO included in the Open

Biomedical Ontologies Foundry.

In the future, one goal can be to integrate the pathway and sequence variation data

for visualization, analysis and modeling purposes using the knowledge present in

ProKinO. Presently, ProKinO has been developed for protein kinases which are a specific

family of proteins and it focuses only on human organism, but in the future this work can

be extended to include knowledge for other related important biomedical families of

proteins such as phosphates and hydrolyses, and also incorporating other organisms.

76  

REFERENCES

1. Noy, N.F. and M. Klein, Ontology Evolution: Not the Same as Schema Evolution.

Knowledge and Information Systems, 2004. 6: p. 428–440. 2. Janik, M. and K.J. Kochut, Wikipedia in Action: Ontological Knowledge in Text

Categorization, in Proceedings of the 2008 IEEE International Conference on Semantic Computing. 2008, IEEE Computer Society. p. 268-275.

3. The Gene Ontology Project. 25 March 2010, date last accessed]; Available from:

http://www.geneontology.org/. 4. Sequence Ontology Project (SO). 25 March 2010, date last accessed]; Available

from: http://www.sequenceontology.org/. 5. The Open Biological and Biomedical Ontologies (OBO) Foundry. 25 March

2010, date last accessed]; Available from: http://www.obofoundry.org/. 6. Natale, D.A., et al., Framework for a protein ontology. BMC Bioinformatics,

2007. 8 Suppl 9: p. S1. 7. Protein-protein Interaction Ontology in OBO Foundry. 25 March 2010, date last

accessed]; Available from: http://www.obofoundry.org/cgi-bin/detail.cgi?id=psi-mi.

8. Protein-modification ontology in OBO Foundry. 25 March 2010, date last

accessed]; Available from: http://www.obofoundry.org/cgi-bin/detail.cgi?id=psi-mod.

9. The Kinase Database at Sugen/Salk. 25 March 2010, date last accessed];

Available from: http://kinase.com/kinbase/. 10. Catalogue of Somatic Mutations in Cancer (COSMIC) 25 March 2010, date last

accesssed]; Available from: http://www.sanger.ac.uk/genetics/CGP/cosmic/. 11. RCSB Protein Data Bank: An Information Portal to Biological Macromolecular

Structures. 25 March 2010, date last accessed]; Available from: http://www.pdb.org/pdb/home/home.do.

12. Protein Families Database (Pfam) 25 March 2010, date last accessed]; Available

from: http://pfam.sanger.ac.uk/.

77  

13. The Universal Protein Resource (Uniprot) 25 March 2010, date last accessed]; Available from: http://www.uniprot.org/.

14. Hendler, J., et al., Web science: an interdisciplinary approach to understanding

the web. Commun. ACM, 2008. 51(7): p. 60-69. 15. The Size of the World Wide Web. Retrieved 25 March 2010]; Available from:

http://www.worldwidewebsize.com/. 16. Jesse , A. and H. Nissan. The Official Google Blog: "We knew the Web was

big...". Retrieved 25 July 2008]; Available from: http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.

17. Domain Counts & Internet Statistics. Retrieved 25 March 2010]; Available

from: http://www.domaintools.com/internet-statistics/. 18. Lee, T.B., J. Handler, and O. Lassila, The Semantic Web. Scientific American,

2001. 19. World Wide Consortium. 25 March 2010, date accesssed last]; Available from:

http://www.w3.org/. 20. World Wide Web Consortium: Semantic Web. 25 March 2010]; Available from:

http://www.w3.org/2001/sw/. 21. http://www.w3.org/2001/sw/. World Wide Web Consortium: Semantic Web Layer

Cake. 25 March 2010, date last accessed; Available from: http://www.w3.org/2007/03/layerCake.png.

22. Resource Description Framework (RDF): Concepts and Abstract Syntax. 2004;

Available from: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/. 23. Lee Feigenbaum, E.P.h. SPARQL By Example: A Tutorial. 28 March 2010, date

last accessed]; Available from: http://www.cambridgesemantics.com/2008/09/sparql-by-example.

24. SPARQL endpoint. 25 March 2010, date last accessed]; Available from: http://semanticweb.org/wiki/SPARQL_endpoint.

25. Gruber, T.R., A translation approach to portable ontology specifications.

Knowledge Acquisition, 1993. Vol. 5: p. 199-199. 26. Gruber, T.R., Toward principles for the design of ontologies used for knowledge

sharing. International Journal of Human-Computer Studies, 1995. Vol. 43(4-5): p. 907-928.

78  

27. W3C: Ontologies in Semantic Web Context. 25 March 2010]; Available from: http://www.w3.org/2001/sw/SW-FAQ#whont.

28. Noy, N.F. and D.L. McGuinness, Ontology Development 101: A Guide to

Creating Your First Ontology. 2001. 29. Bard, J., Ontologies: Formalising biological knowledge for bioinformatics.

Bioessays, 2003. 25(5): p. 501-6. 30. The Suggested Upper Merged Ontology (SUMO) 25 March 2010, date last

accessed]; Available from: http://www.ontologyportal.org/. 31. Basic Formal Ontology. 25 March 2010, Date last accessed]; Available from:

http://www.ifomis.org/bfo. 32. IDEAS group. 25 March 2010, date last accessed]; Available from:

http://en.wikipedia.org/wiki/IDEAS_Group. 33. Antezana, E., M. Kuiper, and V. Mironov, Biological knowledge management:

the emerging role of the Semantic Web technologies. Brief Bioinform, 2009. 10(4): p. 392-407.

34. Bodenreider, O., Biomedical ontologies in action: role in knowledge

management, data integration and decision support. Yearb Med Inform, 2008: p. 67-79.

35. Bodenreider, O., Ontology and Data Integration in Biomedicine: Success Stories

and Challenging Issues. 2008, Berlin Heidelberg New York: Springer: Proceedings of the Fifth International Workshop on Data Integration in the Life Sciences. p. 1-4.

36. The National Centre for Biomedical Ontology (NCBO). 26 March 2010];

Available from: http://bioontology.org. 37. National Institutes of Health. 1 April 2010, date last accessed; Available from:

http://www.nih.gov/. 38. The National Center for Ontological Research (NCOR). 26 March 2010, date

last accessed]; Available from: http://ncor.us/. 39. The European Centre for Ontological Research. 26 March 2010, date last

accessed]; Available from: http://www.ecor.uni-saarland.de/home.html. 40. Smith, B., et al., The OBO Foundry: coordinated evolution of ontologies to

support biomedical data integration. Nat Biotechnol, 2007. 25(11): p. 1251-5.

79  

41. The Open Biological and Biomedical Ontologies 26 March 2010, date last accessed]; Available from: http://www.obofoundry.org/.

42. Noy, N.F., et al., BioPortal: ontologies and integrated data resources at the click

of a mouse. Nucleic Acids Res, 2009. 37(Web Server issue): p. W170-3. 43. NCBO BioPortal. 1 April 2010, date last accessed]; Available from:

http://bioportal.bioontology.org/. 44. The Gene Ontology (GO). 29 March 2010, date last accessed]; Available from:

http://www.geneontology.org/. 45. Protein Kinases. 26 March 2010, date last accessed]; Available from:

http://en.wikipedia.org/wiki/Protein_kinase. 46. Wikipedia. Protein Kinase Phosphorylation. 25 March 2010, date last accessed;

Available from: http://en.wikipedia.org/wiki/Protein_kinase. 47. Protein Kinase Research. 26 March 2010, date last accessed]; Available from:

http://www.kinaseresearch.com/. 48. Sidhu, A.S., T.S. Dillon, and E. Chang, An Ontology for Protein Data Models, in

Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. 2005: Shanghai, China,. p. 1-4.

49. kinase.com. 26 March 2010, date last accessed; Available from:

http://kinase.com/about/Acknowledgements.html. 50. Manning, G., et al., The protein kinase complement of the human genome.

Science, 2002. 298(5600): p. 1912-34. 51. Kinase.com. Human Kinome Poster. 24 March 2010, date last accessed];

Available from: http://kinase.com/human/kinome/. 52. Sequences of Protein Kinases. 26 March 2010, date last accessed; Available

from: http://kinase.com/kinbase/FastaFiles/Human_kinase_protein.fasta. 53. Catalogue Of Somatic Mutations In Cancer. 26 March 2010, date last accessed];

Available from: http://www.sanger.ac.uk/genetics/CGP/cosmic/. 54. Forbes, S.A., et al., COSMIC (the Catalogue of Somatic Mutations in Cancer): a

resource to investigate acquired mutations in human cancer. Nucleic Acids Res, 2010. 38(Database issue): p. D652-7.

80  

55. Protein Families (Pfam) Representation with Multiple Sequence Alignments and Hidden Markov Models. 26 March 2010, date last accessed]; Available from: http://pfam.sanger.ac.uk/.

56. Protein Families (Pfam) Representation with Multiple Sequence Alignments and

Hidden Markov Models. 27 March 2010, date last accessed]; Available from: http://pfam.sbc.su.se/.

57. Finn, R.D., et al., The Pfam protein families database. Nucleic Acids Res, 2010.

38(Database issue): p. D211-22. 58. The UniProt Knowledgebase (UniProtKB). 26 March 2010, date last accessed];

Available from: http://www.uniprot.org/help/uniprotkb. 59. Heger, A., et al., ADDA: a domain database with global coverage of the protein

universe. Nucleic Acids Res, 2005. 33(Database issue): p. D188-91. 60. The Worldwide Protein Data Bank (wwPDB). 24 March 2010, date last

accessed]; Available from: http://www.wwpdb.org/. 61. Biological Magnetic Resonance Data Bank. 26 March 2010, date last accessed];

Available from: http://www.bmrb.wisc.edu/. 62. Crystal Structure of the Complex of Human Epidermal Growth Factor and

Receptor (1IVO) Extracellular Domains. 26 March 2010, date last accessed]; Available from: http://www.pdb.org/pdb/explore/.

63. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res, 2009.

37(Database issue): p. D169-74. 64. The UniProt Reference Clusters (UniRef) 24 March 2010, date last accessed];

Available from: http://www.uniprot.org/help/uniref. 65. About Uniprot. 25 March 2010, date last accessed]; Available from:

http://www.uniprot.org/help/about. 66. Gómez-Pérez A, F.-L.M., Corcho O, Ontological Engineering. Springer–Verlag,

London, United Kingdom, 2003. 67. Vasant Honavar, C.A., Doina Caragea, Adrian Silvescu, Jaime Reinoso-Castillo,

and Drena Dobbs, Ontology-Driven Information Extraction and Knowledge Acquisition from Heterogeneous,Distributed, Autonomous Biological Data Sources. 2001.

68. Jena Ontology API. 25 March 2010, date last accessed; Available from:

http://jena.sourceforge.net/ontology/index.html.

81  

69. The Protégé Ontology Editor and Knowledge Acquisition System. 25 March 2010, date last accessed]; Available from: http://protege.stanford.edu/.

70. Natalya F. Noy, M.K., Ontology Evolution: Not the Same as SchemaEvolution.

Knowledge and Information Systems, 2004. 6: p. 428–440. 71. Christopher Brewster, H.A., An Dasmahapatra Data Driven Ontology Evaluation,

in International Conference on Language Resources and Evaluation 2004. 72. Muller, H.M., E.E. Kenny, and P.W. Sternberg, Textpresso: an ontology-based

information retrieval and extraction system for biological literature. PLoS Biol, 2004. 2(11): p. e309.

73. Zweigenbaum, P., et al., Frontiers of biomedical text mining: current progress.

Brief Bioinform, 2007. 8(5): p. 358-75. 74. The Institute for Genomic Research (TIGR) 27 March 2010, date last accessed];

Available from: http://www.jcvi.org/. 75. Davies, H., et al., Mutations of the BRAF gene in human cancer. Nature, 2002.

417(6892): p. 949-54. 76. Futreal, P.A., et al., A census of human cancer genes. Nat Rev Cancer, 2004. 4(3):

p. 177-83. 77. Greenman, C., et al., Patterns of somatic mutation in human cancer genomes.

Nature, 2007. 446(7132): p. 153-8.


Recommended