+ All Categories
Home > Documents > A Study of Differential Protein Interaction Network

A Study of Differential Protein Interaction Network

Date post: 23-Apr-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
119
A Study of Differential Protein Interaction Network Haogang ZHU MSc Dissertation Presented to School of Informatics College of Science and Engineering University of Edinburgh July 16 2005
Transcript

A Study of

Differential Protein Interaction Network

Haogang ZHU

MSc Dissertation Presented to

School of Informatics

College of Science and Engineering

University of Edinburgh

July 16 2005

i

Authorship declaration

I, Haogang ZHU, confirm that this dissertation and the work presented in it are my

own achievement.

1. Where I have consulted the published work of others this is always clearly

attributed;

2. Where I have quoted from the work of others the source is always given. With the

exception of such quotations this dissertation is entirely my own work;

3. I have acknowledged all main sources of help;

4. If my research follows on from previous work or is part of a larger collaborative

research project I have made clear exactly what was done by others and what I have

contributed myself;

5. I have read and understand the penalties associated with plagiarism.

Signed:

Date: 16 Aug, 2005

Matriculation no: 0455504

ii

Acknowledgement

I would like to express my sincere gratitude to my supervisor, Dr. Douglas Armstrong,

for his support, patience, and encouragement throughout my graduate studies. It is not

often that one can find a supervisor and colleague that is always listening to the little

problems and providing support whenever and whatever he could.

My thanks also go to Dr. Andrew Pocklington for his sensible and intelligent

suggestion and explanation during the investigation of the algorithms. His advice

enlightened the author on the ideas presented in the dissertation and was essential to

the completion of this dissertation.

My parents, Jinping and Hong, receive my deepest gratitude and love for their

dedication and the many years of support during my current and previous studies that

provided the foundation for this work.

Last but not least I am deeply indebted to my girlfriend, Hui, for having the patience to

read and correct such a long technical dissertation from the other side of the world.

Without her unfailing support and encouragement I could never have completed this

work.

iii

Abstract

Following the current research about individual molecule network with system

biology perspective, the author seeks for a way of estimating differential protein

interaction networks from various public and commercial databases compared with

the reference established by SynProNet which presented a molecular complex for the

organization and function of the proteome within the postsynaptic terminal. To

compare different networks, two algorithms are adopted with respect to topological

property and biological significance respectively. Cluster comparison considers the

molecule interaction network to be consisted of clusters which are those important

components dominating the properties of a network and thus standing for the

difference of divergent networks. The algorithm discovers the cluster structure

underlying a network by divisive clustering. A concept of concise graph is proposed to

express the original network compactly without loss of information, and is utilized for

efficient cluster discovery. The differential networks are finally contrasted by

calculating the probability of correctly clustering with respect to the reference network.

Functional component comparison is then implemented to verify the conclusion from

cluster comparison and to investigate a comparison method with biological

significance. This algorithm predicts sets of molecules with potential cognitive

phenotype for each network. The consistency between these predicted molecules and

those generated in reference network is then used as an estimation result. The data

retrieve from various databases is guided by DTD generated by prorogating

functional constraints in relational schema into XML domain.

iv

v

Table of Contents

Chapter 1 Introduction and background mining ................................................. 1

1.1 Introduction........................................................................................................ 1

1.2 Related work ...................................................................................................... 2

1.2.1 Robustness of network ............................................................................. 3

1.2.2 Current network models........................................................................... 4

1.2.3 Topological property of protein interaction network ............................... 5

Chapter 2 Network comparison algorithms.......................................................... 7

2.1 Cluster comparison............................................................................................. 7

2.1.1 Clique and concise graph ............................................................................ 9

2.1.2 Calculating shortest path betweenness...................................................... 12

2.1.3 Cluster division ......................................................................................... 20

2.1.4 Cluster modularity..................................................................................... 22

2.1.5 Searching cluster division ......................................................................... 25

2.1.6 Cluster structure comparison..................................................................... 28

2.2 Functional component comparison .................................................................. 30

2.2.1 Extrapolating probability .......................................................................... 32

2.2.2 Scoring sub-network ................................................................................. 34

2.2.3 Functional component searching .............................................................. 37

2.2.4 Functional component comparison ........................................................... 38

Chapter 3 Data retrieve......................................................................................... 41

3.1 Data overview and constraint driven data retrieve........................................... 41

vi

3.2 Data retrieve implementation ........................................................................... 42

Chapter 4 Result presentation and discussion .................................................... 45

4.1 Cluster comparison........................................................................................... 45

4.1.1 Single database estimation ........................................................................ 45

4.1.2 Robustness of cluster comparison algorithm ............................................ 49

4.1.3 Databases collaboration............................................................................. 51

4.2 Functional component comparison .................................................................. 52

4.2.1 Component score and z-score ................................................................... 52

4.2.2 Single database comparison ...................................................................... 55

4.2.3 Robustness of functional component comparison..................................... 58

4.2.4. Database collaboration ............................................................................. 59

4.3 Discussion ........................................................................................................ 60

Chapter 5 Conclusion and future work ............................................................... 63

5.1 Conclusion........................................................................................................ 63

5.2 Future work ...................................................................................................... 65

Appendix A Constraint Driven Data Retrieve .................................................... 67

A.1 XML functional dependency and key constraint propagation ........................ 67

A.1.1 Formalism of DTD and XML .................................................................. 67

A.1.2 DTD cast and XML functional dependency............................................. 70

A.1.3 Key constraint propagation ...................................................................... 75

A.2 Foreign key constraint propagation................................................................. 77

A.2.1 Foreign key driven pre-mapping .............................................................. 77

vii

A.2.2 One to many relationship and weak entity ............................................... 78

A.2.3 Many to many relationship....................................................................... 80

A.2.4 More complex relationship ...................................................................... 84

A.2.5 Formalized pre-mapping algorithm.......................................................... 87

A.2.6 Post-mapping............................................................................................ 91

Appendix B Estimated databases in project ....................................................... 97

Appendix C ID mapping ....................................................................................... 99

Appendix D PPID and Protein Name ................................................................ 105

Bibliography ........................................................................................................... 107

1

Chapter 1 Introduction and background mining

1.1 Introduction

Following the enthusiastic study on genome, the research of proteins especially with

system biology perspective becomes an attractive approach to various unsolved

problems. SynProNet (Synaptic Proteome Network) project [Grant 2004] carried out

innovative proteomic studies and has identified about 700 proteins comprising the

proteome of the postsynaptic terminal of central nervous system synapses. Beyond

identifying specific proteins, the authors also presented a molecular network model for

the organization and function of the proteome within the postsynaptic terminal.

The interaction network of post-synaptic signal transduction complex MASC

(MAGUK Associated Signaling Complex), which is the central database used in this

project, forms a well-defined component within the post-synaptic density [Armstrong

2005], being that which co-precipitates with the scaffolding protein PSD-95 and the

NMDA receptor sub-unit NR2A. The complex plays a central role in the processing of

synaptic transmissions via the NMDA receptor which leads to the activation of

signaling pathways underlying cognition functions.

Topologically, the synaptic proteome network has a scale-free architecture that is

evolutionary conserved. Activation of channels and receptors (such as NMDA

receptors or Voltage Dependent Calcium channels) could initiate signaling in the

MASC, which then orchestrates the multiple pathways and cellular mechanisms for

the expression of plasticity [Grant 2004]. It is this scale-free network architecture that

gives rise to the robustness and complexity as observed in molecular studies of

synaptic plasticity, and allows distinct patterns of activity in the interaction network.

As to the research on various networks, most previous researches mainly focus on the

topology properties and functioning property of single protein interaction network.

However, what will happen when facing various protein interaction networks for one

particular cell or tissue? This hypothesis is practical if one tries to retrieve a protein

2

interaction network from various relevant databases but unfortunately gets fairly

different results. Trapped into the embarrassment, how to make the choice will be

overly important for further research because the inappropriate decision is proved to

be quite misleading.

Although accurate and complete, the curation of MASC has cost tens of thousands

pond which is spent on the licenses of those commercial databases as well as

employees for maintenance. Therefore, this project intends to estimate the goodness of

networks automatically derived from existing public and commercial databases

compared with MASC database. The result can be used as an evaluation about to what

extent MASC deserves the money spent on it. Moreover, MASC can be automatically

maintained if there is little difference between computer generated network and

manually curated MASC.

Therefore, this project will mainly focus on these differential protein interaction

networks, and will come up with a systemic approach for comparing these

semistructured networks. Based on the manually maintained MASC database the

investigated algorithms will be used to evaluate the protein interaction networks

queried from various databases. The estimation result can be further used as a method

to construct new interaction networks with various existing data sources and to

evaluate these constructed networks.

1.2 Related work

So far, there is little literature about the precise semantic definition of “difference” of

protein interaction networks as well as general networks, which makes the comparison

of the differential networks to be an ambiguous job. However, it has been a long time

for researchers to focus on some general properties of all sorts of networks and graphs.

Recent studies of network structure have concentrated on a small number of properties

that seems to be common to many networks and can be expected to affect the

functioning of networked systems in a fundamental way. Among these, perhaps the

best studies are the “small-world effect”, network transitivity or “clustering” [Watts

1998], and degree distributions [Barabasi 1999]. Many other properties however have

3

also been examined. Examples include resilience to the deletion of network nodes

[Albert 2000] [Cohen 2000], navigability or searchability of networks [Adamic

2001], community structure [Girvan 2002] [Newman 2004]. Two perspectives are

adopted here: the robustness and the topological properties of networks, which are

concluded and related to this project in the following sections. Other properties will be

introduced as they are used in the coming discussion.

1.2.1 Robustness of network

Complex protein and genetic networks are large systems of interacting molecular

elements that control the propagation and regulation of various biological signals

[Kitano 2002]. They are essential classes of biological computations which reflect

vital cellular processes such as the regulation of the cell cycle and gene expression.

Because some intracellular processes are important for the survival of the cell, they

need to be robust to the variation in the environment. For example, the variation of the

concentration of network components in a metabolic network may affect numerous

processes [Jeong 2000]. The complex architecture of such networks raises the

question of the stability of their functioning and topology. How can such complex

dynamical systems achieve important cellular tasks and remain stable against the

variations of their internal parameters? The answer to these questions will definitely

help us to “filter out” those most essential parts of the network and enable us to reveal

the potential difference between two networks rather than simple difference of nodes

and edges.

Three decades ago, [Savageau 1971] hypothesized that robustness was an essential

property of some genetic networks whose functions are preserved even if some of the

components in the networks were changed. It draws our attention on those robust parts

of the network when contrasting different networks with respect to the same tissue or

cell. More recently, several compelling theoretical studies [Yi 2000] [Albert 2000]

demonstrated that the key processes of specific intracellular networks exhibit a robust

behavior to variations of biochemical parameters. Reliability has emerged as a

fundamental concept for the characterization of the dynamical stability of biological

systems.

4

Until this point, things have come up to one question. Are there universal design

patterns that would determine the reliability of a given class of networks? It will then

be possible to predict and contrast dynamical properties of these networks even

without a full knowledge of the molecular details [Fox 2001]. In particular, it is

important to characterize dynamical reliability as the ability for the network to perform

a sequence of biological tasks in the presence of perturbations. As it will be seen in

next chapter, the concept of reliability plays an essential role in the network

comparison algorithm. The construction of “concise graph” from the original

interaction network and the identification of clusters within a network by gradual

perturbations to the interactions (edges) are all based on the thought of reliability.

1.2.2 Current network models

Specifically, a method for modeling molecular activities by a simple two-state model

was originally created for the study of the dynamics of genetic networks [Kauffman

1969], but was later applied to evolution and social models and became a prototype for

the study of dynamical systems. In this model, the network’s architecture follows a

random topology in which every element interacts, with the same probability, with K

other elements. Each network element has two functional states: active and inactive

states as it will be used in functional component comparison algorithm in Chapter 2.

The state of a given element is determined by its interactions with other elements of

the network. An overall parameter for the whole system is the probability p that an

element of the network becomes active after interacting with other elements. This

probability can be inferred by regression model according to the reaction rates and the

relative concentration of each element. The property of the dynamical process within a

given network is determined by its topological parameter K and its biochemical

parameter p. Under the simple assumption of a random network topology, a pretty rich

and unexpected dynamical behavior of the network was found [Kauffman 1969].

During the evolution of the system, the network elements pass through different states

until they reach a cyclic behavior. Different cycles are possible, which represent a

variety of intracellular tasks. In general, there exist two situations in network activity:

chaotic and robust. In the chaotic status a perturbation in the state of a single element

can make the system jump from one cycle to another. In the robust status all such

5

perturbations die out over time. However, this model is inadequate to further

investigate the functioning of biological networks that have heterogeneous

architecture, but it provides an exciting idea to compare networks by contrasting their

activated components, which is one of the algorithms used in next chapter.

In recent years, the analyses of the topology of large intracellular networks [Jeong

2000] have revealed a common architecture. In these natural networks, not all

elements as well as links are equivalently important. A small but significant fraction of

the elements are highly connected, while most of the elements are sparsely connected.

This architecture is called scale-free topology and was found to be common in

networks in social, ecological, and protein networks [Maslov 2002]. Although the

effect of perturbations on the topology of these networks has been thoroughly studied

[Albert 2000], their dynamical properties are still not clear [Barabasiasi 1999],

especially when it comes to how much different perturbations influence the network.

1.2.3 Topological property of protein interaction network

There have been a series of studies focusing on identifying the biological modules in

various cellular networks, ranging from the metabolism [Jeong 2000] to genetic

networks [Ito 2000]. These researches assume that the proteins work together to

achieve some well-defined biological function as a group. Previous experiments

show that protein complexes that act as functional modules carry out many biological

functions. From the network perspective these modules should appear as different

group of nodes that are highly interconnected with each other but have only a few

links to the nodes outside the module (e.g. those hubs in scale-free network).

However, the scale-free topology apparently forbids the existence of independent

modules in the network, because the hub proteins’ ability to interact with a high

fraction of each module’s components makes a module’s relative isolation nearly

impossible. Currently, [Vazquez 2004] proposed and experimented that the

network’s scale-free topology can be reconciled with its potential modularity within

the framework of a hierarchical modularity, which enlighten the author to utilize the

hierarchical cluster structure to depict the essential properties within a network.

This means that protein interaction networks are fragmented into many distinct

6

clusters [Schwikowski 2000]. A set of proteins in a system are always dominated by

a giant cluster that contains a significant fraction of all connected proteins, such that

one can find a path of protein interactions between any two proteins belonging to this

giant component. A small fraction of proteins, however, are completely isolated. A

common situation is that the giant component coexists with many isolated proteins.

In this case, if those giant components are disregard, the cluster size distribution

follows a power law. If we continuously attack these hub proteins as well as their

neighbors in the network, the network can be quickly separated and finally dispelled.

This is exactly the basic idea of “concise graph” proposed in next chapter.

The discussion above concludes some of the current research on network models and

their illumination to this project. Other contributions will be discussed on the way of

presentation of the algorithms later. The next chapter will mainly focus on the

detailed methodology which is used in the project.

7

Chapter 2 Network comparison algorithms

As a common sense, comparison or evaluation should end up with a quantified score

such as a probability to be identical or required energy for the transformation. Based

on different hypothesis and criteria, the comparison can diverge and give rise to

different conclusions. In this chapter, the author is going to demonstrate two different

approaches based on topological and biological significance of molecule interaction

networks.

2.1 Cluster comparison

The first approach is based on the topological property of protein interaction network.

It considers the network to be consisted of clusters which are those important

components dominating the properties of a network. The algorithm origins from a

method for identifying clusters within a network which is firstly proposed by Newman

in 2004. Then, the author proposes the concept of “concise graph” which is a more

compact expression of original network without losing the connection information,

and tries to enhance the computational efficiency with the help of concise graph. The

discovered clusters are then used for network comparison based on the belief that the

properties of network are, at least partly, held by the cluster structure.

It has been a long history for the study of cluster structure in networks. Most of the

research including graph theory, computer science and hierarchical clustering in

sociology [Scott 2000] made use of the ideas of graph partitioning. Unfortunately, in

most cases, a promising solution for a network partitioning is believed to be an

NP-complete problem [Garey 1979], which makes it intractable for large scale

problems like molecule interaction network which may contain thousands of vertices.

The best known achievement is perhaps the Kernighan–Lin algorithm [Kernighan

1970] with O( 3n ) time complexity on sparse graphs. This algorithm recursively

exchanges sets of nodes to perform hill-climbing to find improvements where no

single swap will improve the total net cut. However, how a given network is broken

8

down into clusters can hardly provide any information about how many such clusters

they are going to be and what is the size of each cluster. Furthermore, there is no reason

that the number of inter-cluster edges should be minimized because it is nature to have

more such edges between large clusters than between small ones [Newman 2004].

For expression convenience, we defined the protein interaction network to be an

undirected graph.

Definition: A protein interaction network G is an ordered pair <V, E>, where V is a

non-empty set of proteins, and VVE ×⊆ is a set of interaction, e=(u, v) ∈E if and

only if protein u interacts with protein v. Neighbor (v) is a mapping V → ∗V . It returns

a set of neighbors of vertex v∈V. Degree(v) returns the degree of vertex v.

Within the scope of this dissertation, we presume that the interaction network dose not

contain self-loops and the adjacency is irreflexive which means that E = {(u, v) | u, v

∈ V have interaction ∧ u ≠ v} and 1e =(u, v) ∈ E, 2e =(u, v) ∈ E if and only

if 1e = 2e .

The most intuitive characteristics for networks are local properties, which concern the

connection among the vertices nearby, and global properties focusing on the

interaction among all vertices in the network. Following the previous research on

graph theory dealing with overall topological characteristics, “color problem” as well

as current and voltage in electric circuits, etc, recent research triggered an intensive

research on the local property of network such as cluster identification [Newman 2004]

and functional clustering of proteins in a network [Andrew 2005]. This means that, in

spite of global characters, the local properties of a network, which make more sense in

some circumstance, are more and more of interest to researchers.

From a philosophy perspective, global and local properties are not independent. This

chapter will show how the computation of cluster structure can benefit from the scale

free property of molecule interaction networks. The basic framework of the algorithm

follows the work [Newman 2004] which attempts to find the least similar connected

pairs of vertices and then remove the edges between them. By doing this repeatedly,

the network is gradually divided into smaller components within which the vertices

9

contain compact connections.

Instead of looking for the most weakly connected vertex pairs, [Newman 2004] looks

for the edges in the network that are most “between” other vertices, meaning that “the

edge is, in some sense, responsible for connecting many pairs of others”. To make

things more concrete, if two communities are joined by only a few inter-community

edges, then all paths through the network from vertices in one community to vertices

in the other must pass along one of those few edges [Newman 2004], which will result

in high betweenness value. Given a suitable set of paths, one can count how many goes

along each edge in the graph, and we can expect this number to be large for the

inter-community edges, which provides a method for identifying them.

Therefore, the following sections will concentrate on the discovery of cluster structure

within a network using shortest path betweenness. Since the cluster structure

discovery is always computational intensive, the author proposes a concept called

concise graph on which betweenness can be efficiently computed.

2.1.1 Clique and concise graph

The simplest local property of a network is probability the connection between one

vertex and its neighbors. This can be defined as a clique which is a set of centrally

connected vertices in a network G <V, E>.

Definition: A vertex set vC = {v} ∪ Neighbor (v) is defined to be a clique centralized at

v∈V. It is represented as vC =v[{v} ∪ Neighbor (v)]. The potential of a clique is the

number of vertices within this clique.

Because of the reported scale-free property of molecule interaction network, only a

small part of cliques have relatively large potential. The central vertices in these

cliques act as the “skeleton” of the network, which means that these proteins dominate,

directly or intermediately, most of the interactions within the network, and therefore

maintain the major properties of the network.

The network in Figure 2-1 shows a typical scale free network. The skeleton vertices

represented by black color are calculated by the algorithm to be discussed soon. It can

10

be seen that these vertices take up all of the interactions, and the cliques centralized on

these nodes contain all vertices in the network. Consequently, there is no lost

information if we only know the connection within these cliques. At the same time,

each clique reveals the local property which means that the central vertex can only

have interaction to those non-neighbor vertices through its neighbors.

Figure 2-1: A small scale free network with the skeleton vertices colored with black.

The skeleton vertices are acquired by continuously eliminating the vertices with the

highest degree in the network until all vertices are either removed or have 0 degree.

The algorithm is given bellow:

Initialization:

degreeArray is an array holding degrees of all vertices in the network;

remainedVertices is a set containing vertices which have not been removed from the

network and have non-0 degree. It is initialized with the whole set of vertices;

cliqueSet is a set having all cliques generated by the algorithm;

Loop:

While remainedVertices is not empty{

Search for vertex maxv with the largest degree value in degreeArray;

cliqueSet = cliqueSet ∪maxvC ;

for vertices in Neighbor ( maxv ), decrease the degree value in degreeArray by 1;

remainedVertices = remainedVertices-{ maxv }

remainedVertices = remainedVertices-{v | the degree of v in degreeArray is zero}

}

11

Return: cliqueSet.

It can be seen that, instead of physically removing vertices from graph, the attack on

the hub vertices is done by subtracting the degree values of corresponding vertices in

degreeArray. There is no modification to the network during the running of the

algorithm. The returned result is a set of cliques with which a concise graph can be

constructed.

Definition: A concise graph of network G <V, E> with clique set cliqueSet is defined to

be C_G <C_V, C_E> where C_V = cliqueSet, C_E ={( 21 , CC )| 21 , CC ∈ C_V, 1C ∩ 2C Φ≠ }.

As an example, a concise graph can be derived from the network in Figure 2-2 by

eliminating vertices in an order of 5, 11, 15, 12, 19, 10, 17, 3, 6, 16, 20. In this case, the

result cliques are 5[5, 6, 3, 11, 2, 16], 11[11, 9, 5, 7, 10, 13], 15[15, 20, 21, 4, 19],

12[12, 9, 7, 8, 13], 19[19, 15, 20, 21, 17], 10[10, 9, 8, 11, 14], 17[17, 18, 16, 19], 3[3, 5,

7, 2], 6[6, 1, 5, 4], 16[16, 5, 17, 18], 20[20, 15, 19, 21]

The concise graph of the original network is represented in Figure 2-2 generated by

application. It is obvious that the resulted concise graph contains less vertices and

edges compared with the original network, which is the consequence of the scale-free

property of the network as it was stated in [Albert 2000] such that continuous attack

on “hub” vertices will quickly dispel a scale-free network.

Figure 2-2: The concise graph constructed by attacking on hub vertices. It is obvious that it contains less vertices and edges compared with the original graph in Figure 2-1.

The concept of concise graph will be used to efficiently calculate the shortest path

betweenness in the following sections.

12

2.1.2 Calculating shortest path betweenness

As it was mentioned at the beginning of this chapter, we will adopt the shortest path

betweenness during the process of divisive clustering. [Freeman 1977] defines a

traditional calculation for vertex betweenness based on the constraint that multiple

shortest paths between a pair of vertices are equally important and are given equal

weight values summing to 1. For instance, if there are three shortest paths, each will be

given weight 1/3. In this dissertation, the author adopts the same definition for edge

betweenness although other definitions are possible [Goh 2001]. To calculate the

fraction of the paths flowing along each edge in the network given a specific source

vertex, we make best use of concise graph and generalize the breadth-first search in

[Newman 2004].

Because shortest path betweenness finds the shortest paths between all pairs of

vertices and counts how many runs shortest paths go along each edge, the algorithm

will be repeated with all different source vertices. For example, if there are n vertices

in a network, we will calculate betweenness for n times with different vertex as the

source. In this case, large n value will be computational demanding, and thus require

an efficient algorithm.

Fortunately, with the help of concise graph, the betweenness scores of edges proposed

in [Newman 2004] can be efficiently computed. Specifically, given a source vertex s,

the corresponding betweenness can be calculated by propagating and retrieving (back

propagating) massage over the concise graph. Before furthering on the algorithm, we

shall define a concept of separator on concise graph.

Definition: A separator from clique uC to vC is defined to be a set S such that:

S=⎩⎨⎧

∩∩∈ else

if}{

vu

vu

CCCCvv

It is denoted as uC ⎯→⎯SvC

The algorithm of massage propagation is depicted as:

Initialization:

13

If the source vertex sv doesn’t form a clique centralized on it, insert svC into concise

graph and set up the edges between svC and other cliques;

Assign a distance and the number of path to svC such that

svC .distance = 0 and svC .

pathNumber = 1;

Add clique svC into upperCliques which is a set containing the cliques in the upper

level during the message propagation;

cliquesToBeProcessed = UeupperCliquC

C∈

)(Neighbor - upperCliques;

Push upperCliques into backPropagationStack which is a stack reserving the layered

structure of the concise graph. This stack will not be used until massage back

propagation.

Loop:

While cliquesToBeProcessed is not empty{

For all vC ∈cliquesToBeProcessed {

vC .distance=⎩⎨⎧

+′⎯⎯→⎯′+′

∩∈′ else 2 if 1min

}{

)( .distanseCCC.distanseC v

v

esupperCliquvCC Neighbor;

vC ={v} ∪ ( vC - UesupperCliquvC ∩∈ )(NeighborC

C );

If vC contains only v, delete vC from concise graph;

Delete edge ( vC , C′ ) if esupperCliquCC v ∩∈′ )(Neighbor and vC .distance

⎩⎨⎧

+′⎯⎯→⎯′+′

≠ else 2

if 1 }{

.distanseCCC.distanseC v

v;

∑∩∈′ ∈

−×′=esupperCliquvC esupperCliquC

vv

vSpathNumberpathNumberC)(

||.NeighborC

.C U , where

C′ ⎯→⎯SvC and “|A|” denotes the potential (number of elements) of set A;

}

Delete non-central vertex v from the clique whose distance value is not the

minimum among all cliques containing v;

Push cliquesToBeProcessed into backPropagationStack;

tempCliquesSet = cliquesToBeProcessed;

14

cliquesToBeProcessed = U ∈ eProcessedcliquesToBC

C)(Neighbor - upperCliques -

cliquesToBeProcessed;

upperCliques= tempCliquesSet;

}

Figure 2-3 shows an example of the algorithm with the source vertex 16. Starting from

clique 16C in the concise graph in Figure 2-2, the algorithm iteratively collects the

direct neighbors of the cliques in upperCliques in the concise graph excluding those in

upperCliques and current cliquesToBeProcessed, and stores them as new

cliquesToBeProcessed. The upperCliques is updated with the old

cliquesToBeProcessed at the end of each iteration. Indeed, this process separates the

concise graph into several layered structure. The cliques in cliquesToBeProcessed are

on the next level away from the source clique, and will be processed in the next

iteration. This can be considered as a procedure of message propagation from the

source clique away to other cliques level by level. From this perspective, the algorithm

can be considered as a breadth-first traverse of the concise graph.

Figure 2-3: The concise graph with source vertex 16. The separators are denoted as the squares on edges and the two numbers above each clique are the distance value and number of paths respectively.

Separators between the cliques in upperCliques and those in cliquesToBeProcessed

are calculated according to the definition and are displayed as the squares on the edges

of the concise graph in Figure 2-3. The distance and number of paths are then

computed based on the property of separators. There are two cases in total: for

uC ⎯→⎯SvC if S={v} which means that the central vertex of vC and uC are directly

connected in the original network, the possible distance of vC will be uC .distance+1;

15

otherwise, the possible distance of vC will be uC .distance+2. For instance, because

16C ⎯⎯ →⎯ }5{11C , the central vertices of 16C and 11C are connected via vertex 5. In the

case, when traversing from 16C to 11C , the interval will be 2 (16 → 5 → 11), which

makes the possible distance of 11C to be 16C . distance +2=2. A counterpart example

is 16C ⎯→⎯55C in which the central point 16 and 5 are directly connected. Therefore,

the possible distance of 5C is counted as 16C . distance +1=1. Because of the shortest

path criteria, only the minimum possible distance can be assigned as the final distance

of a clique. The distance values are shown as the first number above corresponding

cliques in Figure 2-3.

It can be seen that the distance value of each clique is actually the length of the shortest

path from the source vertex to the central vertex of the clique. In this case, it can be

said that the shortest distance between vertex 16 and vertex 11 is 2, and the shortest

distance between vertex 16 and vertex 5 is 1. This result can be verified in the original

network in Figure 2-1.

A more general example is the calculation of 12C .distance. Because 11C ⎯⎯⎯ →⎯ }13,7,9{12C

and 3C ⎯⎯ →⎯ }7{12C , two possible distance for 12C will be 11C .distance+2=4 and

3C .distance+2=4, where the distance of 11C and 3C has been calculated in previous

iteration. Therefore, the minimum distance of 12C is 4. The same computation forms

the distance value of 10C which is 3.

It is intuitive to understand the calculation of path number of a clique. The value of

path number of clique vC is actually the number of shortest routes from the source

vertex to the central vertex v. For uC ⎯→⎯SvC , let n(u, v) counts the number of

different shortest paths from u to v through shortest path starting from source vertex.

Suppose there are n(u) shortest paths from the source vertex to u. It is easy to see that

n(v)= ∑ ×u

u, vnun )()( where n(u, v) is || UesupperCliquCv

vS∈

− if there is shortest path

starting at source vertex and passing through u to v, which is indicated by the fact that

uC can form the distance of vC during the running of the algorithm, else n(u, v)=0.

16

Take the path number of 12C as an example. Because of 11C ⎯⎯⎯⎯ →⎯ }13,9,7{12C and

3C ⎯⎯ →⎯ }7{12C , both of which result in 12C .distance, the path number of 12C is

calculated as: |},,{|.11 379pathNumberC × + |}{|.3 7pathNumberC × =1× 3+1× 1=4. A

more interesting example is 10C . Because separator {11} contains a central vertex in

upperCliques which is {11, 17, 5, 3, 6, 19}, the path number of 10C is calculated

as 10111}{}{.}{. 511 =×+×=−×+× 1111pathNumberC10pathNumberC . Removal

of 11 from separator {11} avoids the duplicated calculation of the same path because

16C ⎯⎯ →⎯ }5{11C ⎯⎯ →⎯ }10{

10C and 16C ⎯⎯ →⎯ }5{5C ⎯⎯ →⎯ }11{

10C actually represent the same path

from 16 to 10 which is 16 → 5 → 11 → 10.

A more interesting phenomena is that vertex 2 is originally contained in both cliques

3C and 5C . Since the distance score of 3C is 2 which is not the minimum distance

value among those cliques containing vertex 2, 2 is eliminated from 3C during the

algorithm.

This operation ensures that all of the remained non-central vertices have their shortest

path from the source vertex via the central vertex of the clique. In this case, the number

of cliques containing a non-central vertex v depicts the number of the shortest path

from the source vertex to v.

The algorithm above, although works on concise graph, actually calculates the number

of paths from source vertex to all of other vertices on which cliques are centralized in

concise graph. Besides, the algorithm gradually divides the concise graph into layered

structure which is stored in stack backPropagationStack. These path numbers are

precisely what we need to calculate edge betweenness value, because if two vertices u

and v are connected, with v farther than u from the source s, then the fraction of a

geodesic path from s through u to v is given by u.pathNumber/v.pathNumber.

The algorithm of calculating edge betweenness in [Newman 2004] is described as:

1. Find every “leaf” vertex t, i.e., a vertex such that no paths from s to other vertices go

through t.

2. For each vertex i neighboring t assign a score to the edge from i to t of

17

i.pathNumber/t.pathNumber.

3. Starting with the edges that are farthest from the source vertex s, work up towards s.

For the edge from vertex i to vertex j, with j being farther from s than i, assign a score

that is 1 plus the sum of the scores on the neighboring edges immediately below it, all

multiplied by i.pathNumber/j.pathNumber.

4. Repeat step 3 until vertex s is reached.

With the help of concise graph, the “farther” relation has been given by the stack

backPropagationStack, and the path number of vertices in original network has been

heuristically calculated by the path number of cliques in concise graph:

For the central vertex v of clique vC , the path number of v is exactly the path number

of vC . Otherwise, the path number of non-central vertex u is the number of cliques

containing u in the concise graph after massage propagation.

As the cliques in backPropagationStack stack are stored in an opposite order of

massage propagation, the calculation of betweenness can be considered as message

back propagation because the algorithm will work from the “bottom” cliques up

towards the source clique. The passed message here will be the betweenness values of

the edges in lower level.

The algorithm of computing edge betweenness by massage back propagation is given

as bellow:

Initialization:

Betweenness of all edges are initialized to be 0;

cliquesToBeProcessed = pop(backPropagationStack);

Loop:

While backPropagationStack is not empty{

upperCliques = pop(backPropagationStack);

For all vC ∈cliquesToBeProcessed in an descendent order of distance{

weightSum= ∑−∈ }{

).,(vvC

weightuvu

, where (v, u).weight=eru.pathNumberv.pathNumb if (v,

18

u).weight is 0;

For all esupperCliquCC vw ∩∈ )(Neighbor {

If wC ⎯⎯ →⎯ }{vvC , (w, v).weight=(1+weightSum)×

erv.pathNumberw.pathNumb ; else for r∈S

– UesupperCliquCv

v∈

such that wC ⎯→⎯SvC , (r, v).weight = (1+weightSum)

×erv.pathNumberr.pathNumb and

; otherwise

..)).,(1().,(

0).,( if..).,().,(

).,(

⎪⎪⎩

⎪⎪⎨

×++

≠×+=

pathNumberrpathNumberwweightvrweightrw

weightrwpathNumberrpathNumberwweightvrweightrw

weightrw

}

}

cliquesToBeProcessed = upperCliques;

}

Figure 2-4 continues the demonstration of calculating betweenness of all edges in the

network of Figure 2-1. The backPropagationStack after the massage propagation is

{ 12C , 10C }, { 11C , 6C , 3C , 19C , 17C , 5C } and { 16C }. This forms three layers of the

network which is indicated by the dashed line in Figure 2-4. In this case, starting from

12C and 10C , the algorithm firstly calculates the sum of the betweenness values of the

edges “bellow” vertex 12. Within concise graph formed by massage prorogation, for

clique vC , the edges “bellow” vertex v is the ones between v and those vertices in

vC -{v}. However, from the concise graph in Figure 2-3, there is no edge bellow vertex

12 (weightSum=0 for 12C ), which indicates that 12 is indeed a leaf vertex.

For 11C ⎯⎯⎯⎯ →⎯ }13,9,7{12C , (7, 12).weight=(1+0) × 2/4=1/2, where 7.pathNumber is the

number of cliques ( 11C and 3C ) containing vertex 7 and 12.pathNumber is exactly

the pathNumber of clique 12C which is 4. Similar calculation yields (9,

12).weight=1/4 and (13, 12).weight=1/4. Besides, according to the algorithm, because

(11, 9).weight=0, (11, 9).weight is updated by (1+(9, 12).weight)×pathNumber9pathNumber11

..

19

=(1+1/4)×1=5/4. The same steps give rise to the result such that (11, 13).weight=5/4

and (11, 7).weight=(1+1/2)×1/2=3/4.

The result of betweenness of all edges are shown in Figure 2-4 after having gone

through all of the cliques and executing the same steps as it was described above.

Figure 2-4: Edge betweenness values of the original network with vertex 16 as the source vertex. The path number of each vertex is denoted by the number right above and the betweenness values are shown near edges. The dashed line indicates the division of network into three layers which are formed by message propagation in backPropagationStack. The central vertices are shaded with black.

Note that the algorithm guarantes that when it comes to vC , the betweenness values of

edges between v and vC -{v} have been all calculated if the vertex in vC -{v} is not a

leaf node. The leaf vertex t is indicated by 0 betweenness of the edge between v and t.

For example, when processing 19C , we find that (19, 15).weight=0 which shows that

vertex 15 is a leaf node. On the other hand, when processing 11C , (11, 9).weight is not

0 so that 9 is not a leaf node and (11, 9).weight has been calculated before the

algorithm comes to 11C .

Therefore, it can be seen that the order of cliques is essentially important for

correctly calculating betweenness values. This order is guaranteed by the “layered”

structure of the network which is maintained by backPropagationStack. Referring to

the algorithm of massage propagation, it is easy to find that the cliques in

backPropagationStack is ordered, which means that the cliques on the top of

backPropagationStack is the ones with longest distance so far and should be processed

20

earlier then any others in backPropagationStack.

Compared with the algorithm in [Newman 2004] which calculates the number of

distinct paths from the source vertex to all other vertices, the algorithm here is carried

out on the concise graph with much less vertices and edges based on the assumption of

scale free property of molecule interaction network. Since the massage propagation

and back propagation are actually based on breadth-first traverse of concise graph

rather than on original network, the algorithm above will be less computationally

intensive.

Besides, one obvious characteristic of the algorithms above is that most

computationally expensive procedures are achieved by set operation which can be

implemented efficiently using proper techniques such as hash set or hash map, etc.

As discussed at the beginning of this section, the algorithm above will be repeated with

all different source vertices in the network. The overall betweenness for an edge is the

sum of weight values of it in all iterations.

2.1.3 Cluster division

Edge betweenness depicts the importance of the edge in the network. The higher the

betweenness is, the more important the edge is. In other words, the edge with high

betweenness value is more likely to be the one lying between clusters. Consequently, if

we continuously knock out the edges with the highest betweenness values, the clusters

can be separated out within a short time. Thus the general form of network division

algorithm is as follows [Newman 2004]:

1. Calculate betweenness scores for all edges in the network according to the algorithm

in last section;

2. Find the edge with the highest betweenness and remove it from the network;

3. Recalculate betweenness for all remained edges;

4. Repeat from step 2.

As it was stated by Newman, the recalculation step is the most important feature of the

21

algorithm. Adversely, merely calculating the edge betweenness for all edges in the

network and then removing edges in a decreasing order of betweenness to produce the

division of the network will cause some problems. Particularly, once an edge in the

network is removed, the betweenness values for the remaining edges will no longer

reflect the properties of current network.

Figure 2-5: The tree structure of clusters of network in Figure 2-1. Some of the leaf nodes have two vertices because it is obvious enough for how to divide two vertices into two clusters.

This can cause unexpected problems. Take a scenario used by Newman as an example.

If two clusters are joined by two edges, but most paths between the two clusters flow

along just one of those edges, that edge will have a high betweenness score and will be

removed at early stage of the algorithm, while the second edge might not be removed

until much later. This will give rise to the problem that the obvious division of the

network into two clusters might not be discovered by the algorithm. What is even

worse is that these two clusters might be individually broken up before the division

between them is discovered.

Therefore, the algorithm used here recalculates betweenness after each edge removal.

As the edges are removed, the cluster structure will be separated out when the network

is no longer connected because of the removal of edges. The newly formed clusters

can be further divided into sub-clusters as the algorithm move on.

This will form a tree structure as it is shown in Figure 2-5. The cluster division tree

22

depicts how the network in Figure 2-1 is divided into clusters until only individual

vertex is contained in each cluster. This tree structure will be used for searching the

best cluster division in the following sections.

2.1.4 Cluster modularity

Until this point, the shortest path betweenness of a given network has been efficiently

calculated by constructing concise graph from the original network. As it was

mentioned earlier, higher betweenness value means that there are more shortest paths

passing through corresponding edge. It is straightforward to imagine that if two

clusters are connected by relatively less number of edges, larger number of shortest

path will pass through these edges, which will result in higher betweenness of these

edges. In this case, if we remove these edges iteratively, the network will be divided

into parts quickly until there is only one vertex in each part. Therefore, if we stop in the

middle of the algorithm before the network is broken down into single vertex, several

clusters can be retrieved. [Newman 2004] has shown that the shortest path

betweenness works well on recovering known clusters by cutting the cluster tree like

Figure 2-5 at proper position.

However, in practical situations, the algorithm will normally be used on those

networks in which the clusters are unknown. Generally, the algorithm can always

divide the network into clusters, even in completely random networks having no

meaningful cluster structure. In this case, it is necessary to know which divisions are

the best ones for a given network and how to search for these divisions among huge

amount of possibilities.

To quantify the “goodness” of found clusters, we use a measure called modularity used

in [Newman 2003]. For a network including k clusters, a k×k symmetric matrix M is

defined such that element ijM is the fraction of all edges linking vertices in cluster i to

vertices in cluster j. Particularly, as mentioned in [Newman 2003], each edge can only

be counted once in the matrix M to avoid duplication. This means that the same edge

should not appear both above and below the diagonal of M. To make sure that the

matrix is symmetric, an edge linking clusters i and j is split into half-and-half between

ijM and jiM . Moreover, when calculating modularity, all edges in the original

23

network are taken into account regardless whether the edges have been removed by the

clustering algorithm.

Suppose that the network in Figure 2-1 can be divided into three clusters as it is shown

in Figure 2-6. The corresponding matrix M is given on the right of Figure 2-6.

Figure 2-6: Left: an example of cluster division of network in Figure 2-1. The three clusters are indicated by the dashed circle with the index number outside. Right: the M matrix corresponding to the cluster division in the left figure. The total number of edges is 31 which is the denominator. The fraction number off the diagonal is split into half-and-half symmetrically with respect to diagonal.

It can be seen form the M matrix in Figure 2-6 that all elements in the matrix add up to

1 because the total amount of edges is 31. Moreover, since there are 2 edges between

cluster 1 and 2, these edges are split averagely between these two clusters, and thus

form 311 on 21M , and 12M , .

Based on M Matrix, [Newman 2004] defines two measures to quantify the quality of

cluster. The trace of matrix M, Trace(M)= ∑i

iiM gives the “compactness” of the

clusters in the network. Specifically, Trace(M) shows the fraction of edges

connecting vertices in the same cluster so that a good division of clusters should have

a high value, which means that majority of the edges are within the clusters and only

a small part of them lay between different clusters.

Nevertheless, Trace(M) is not always a good estimator of the quality of cluster

division. For example, if we place all vertices in a single cluster, this will give the

maximal value of Trace(M) = 1 with no information about cluster structure. To tackle

24

this problem, there should be a reference network which has no cluster structure so

that we can compare the difference between the reference network and the one we

are looking at.

In a network, vertices can form a cluster because they are tightly connected by the

edges within the cluster and are loosely connected to those vertices outside the

cluster. In other words, if the fraction of connections inside and outside the cluster is

the same, such a cluster will not exist. Therefore, serving as a contradictory to the

network with cluster structure, the reference network should have average edge

distribution both inside and outside clusters. The “average” means that each vertex is

linked to vertices in all of the clusters with the same probability. For example, to

generate a reference network with three clusters, each vertex is connected to another

vertex in each cluster with probability 1/3.

Because ijM can be interpreted as the probability that there is an edge connecting

an vertex in cluster i and another vertex in cluster j, the reference network should

have the same entries in M matrix. For example, an M matrix for a typical reference

network with three clusters should be:

⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜

919191

919191

919191

To quantify the property of reference network, another measure, iM =∑j

ijM , is

defined and represents the fraction of edges that connect to vertices in community i

[Newman 2003]. In a reference network in which edges fall between vertices

without regard for the clusters they belong to, ijM = ji MM should always be true.

This can be verified by some simple calculation. Suppose the reference network with

T clusters has M matrix filled with the same entry a. The only constraint for M matrix

is that all elements should add up to 1: 12 =aT so that:

25

21

Ta =

In this case, iM = jM =∑j

ijM =Ta =T1 , and thus jiMM = 2

1T

=a where a is defined

to be ijM .

With this property of reference network, we use the definition of modularity measure

proposed in [Newman 2003] to quantify the cluster strength:

Q =∑ −i

iiii MMM )( = ∑−ji

MMTrace,

2)(

This quantity measures the difference of fraction of the edges within the same cluster

between the estimated network and the reference network which has no cluster

structure. As it was proved above, when there is no cluster structure in the network,

which means that the number of within-cluster edges is no better than random,

ijM = ji MM will be true and the modularity value Q=0. On the other hand, the

maximum value of Q which is 1 indicates strong cluster structure. In [Newman

2004], it is reported that modularity values for normal networks typically fall in the

range from about 0.3 to 0.7 in practice. Particularly, according to the M matrix in

Figure 2-6, the modularity value of the cluster division is 0.5286.

2.1.5 Searching cluster division

The modularity value defines a measure of “fitness” of cluster division in a network

so that it can be utilized to search for a good cluster division. To make the searching

algorithm easy and interpretable, the break down of network is illustrated with a

dendrogram [Newman 2004] in which the nodes at the bottom represent individual

vertices of the network (Figure 2-7). Moving up the dendrogram will join the vertices

together and form larger clusters until reaching the top where all vertices are

connected together. On the other hand, from up to bottom, the dendrogram describes

how a network is split into smaller clusters until there is only one vertex in each

cluster [Newman 2004].

The dendrogram shows the order in which the network is divided and can be easily

achieved from the cluster division tree like the one in Figure 2-5. Particularly, there is

26

an additional “timer” which indicates the moment the clusters are generated by

removing edges. At each time step, in all current clusters, only one edge with the

highest betweenness score can be removed. Whenever the removal of edges gives rise

to a separation of a cluster into sub-clusters, these sub-clusters will be marked by the

current “time” which is actually the counted number of removed edges at this moment.

With the “time” of cluster division, the cluster division tree can be easily converted

to the dendrogram by reversing the divisive process: from time 0, working from the

clusters containing only individual vertices, join the clusters in the opposite order of

removing edges. The joining procedure is achieved by the following steps:

Figure 2-7: A dendrogram for the cluster division tree in Figure 2-5. A cross-section of the tree at any level, as indicated by the dashed line, will give rise to a cluster structure. According to the algorithm of generating the dendrogram, the height of the clusters are indicative of the order in which the joins (or split indirectly) take place [Newman 2004]. The height values here are just for demonstration purpose and are not as accurate as the ones computed by program.

1. Put all clusters containing single vertices at the bottom of dendrogram, and start

the timer of joining clusters;

2. At each time step, pick out one cluster with the largest division time when the

network is divided by removing edges, and put this cluster in the dendrogram on the

height of current time;

3. Go back to 2 and increase the timer by 1 until all clusters are in the dendrogram.

The connections among clusters in dendrogram are exactly the ones in cluster

division tree so that we needn’t care much about them. The corresponding

27

dendrogram for the cluster division tree in Figure 2-5 is depicted in Figure 2-7.

Working from the cluster division tree in Figure 2-5, the construction of the

dendrogram starts from the leaf nodes [8] and [9, 10]. Here we didn’t start from [9]

and [10] because the cluster structure will become obvious far before it comes down

to [9, 10] from the top of the dendrogram if there is obvious cluster structure. This

will be mentioned in following discussion.

As it was suggested by Newman, modularity value is calculated for each division of

a network as moving down the dendrogram. The way to identify promising cluster

structure is to look for local peaks in modularity values when moving down the

dendrogram. [Newman 2004] proved that the cross-section of dendrogram at the

position with peak modularity value typically indicates particularly satisfactory

cluster structure. It also shown that there were usually only one or two such peaks. In

cases where the cluster structure is known beforehand, they found that “the positions

of these peaks correspond closely to the expected divisions”.

Figure 2-8: The change of Modularity value when traveling down the dendrogram in Figure 2-7. It can be seen that there is only one peak during the process, which is the same result as [Newman 2004].

The investigation of the simple network in Figure 2-1 shows the similar result as it is

depicted in Figure 2-8. The cross-section of dendrogram is attained by specifying

increasing dendrogram depth. Particularly, the depth is set to be 0.5, 1.5, 2.5…until

the algorithm achieves the bottom of dendrogram. At each step, the cut dendrogram

28

gives a cluster structure whose modularity value is plotted on the figure. It can be

seen that the modularity value achieves a peak value when the depth is 3.5 the

position of which is depicted by a dashed line in Figure 2-7. Moreover, as it was said

above, the position of peak modularity value is at the high level of the dendrogram

and rarely comes to the bottom level.

2.1.6 Cluster structure comparison

The result of searching cluster division gives rise to a good cluster structure of the

network. Meanwhile, different network will have different underlying cluster

organization which can be attained by the algorithm. If we believe that the clusters of

a network have some special significance such as similar function or localization of

proteins, the result of diverse cluster structures from different networks can reflect

potential difference of these networks to some extends.

Therefore, it is reasonable to compare the networks by contrasting their cluster

structures. Specifically, if we consider the process of clustering as classification, the

difference of network can be simple judged by the rate of correct classification.

Recall that the problem in this project is to estimate the correctness of the networks

from other databases compared with the network from MASC database which we

consider as the “right” underlying network of post-synapse. The cluster organization

of the network from databases can be attained by simply running the algorithms in

previous sections.

The problem can be formally depicted as: given the reference network from MASC

denoted as rG and T clusters { rT

r ClCl ,...,1 } returned by cluster structure discovery

algorithm, what is the probability of correct clustering based on network xG from

one of other databases with cluster structure { '1 ,..., TClCl }.

One way to do this is to find the counterpart cluster in { rT

r ClCl ,...,1 } for each cluster

in { TClCl ′,...,1 }. Since there is no information about this, a theoretical approach is to

try all possible match and pick up the one with the highest accuracy. However, as the

number of the clusters goes up, the problem will become intractable and

29

computational intensive if we want to try all possibility. Therefore, the author adopts

a heuristic method to search for a better match between two cluster structures.

Before furthering on the search algorithm, suppose NClCl ,...,1 have been mapped to

rN

r ClCl ,...,1 respectively, where N=min{T, T’}. The probability of correctly

clustering the vertices in the reference network using estimated network is calculated

by:

P =

UT

i

ri

N

ii

ri

Cl

ClCl

1

1

=

=∑ ∩

In this formula, the denominator is the number of vertices in standard MASC

network and the numerator is the total number of vertices which are correctly

clustered into the right cluster based on the match from { NClCl ,...,1 } to

{ rN

r ClCl ,...,1 }.

It can be seen that the maximum value of this probability is

U

UUT

i

ri

T

ii

T

i

ri

Cl

ClCl

1

11

=

==

∩.

Therefore, only if the estimated network contains all of the vertices in the reference

network can P have possibility to be 1, because UUT

ii

T

i

ri ClCl

==

∩11

=UT

i

riCl

1=

in this

situation.

However, as we can imagine, the estimated network can hardly performs better than

the standard network and normally contains much less vertices. This means that the

estimated network can rarely achieve a high P score as long as its highest value is

much less than 1. To avoid this limitation, we rescale the probability as:

rescaleP =

UUT

ii

T

i

ri

N

ii

ri

ClCl

ClCl

11

1

==

=

∩∑

30

which depicts the probability of correctly clustering the vertices shared in the

estimated network as well as the reference network by applying cluster structure

discovery algorithm on the estimated network.

Both probabilities will be examined in Chapter 4 for each network, but probability P

is used to search for a matching between two cluster structures because we want to

take into account of the coverage of vertices in estimated network. The algorithm is

given bellow:

1. Randomly choose a cluster iCl in '1 ,..., TClCl ;

2. Match iCl to rjCl if r

ji ClCl ∩ achieves the highest value for rjCl ∈

{ rT

r ClCl ,...,1 };

3. Delete iCl and rjCl from the cluster structures respectively, and repeat 1 and 2

until either cluster structure is empty.

The algorithm is repeated for 100 times in this project in order to find a high

probability value P.

After matching the cluster structures, the scores P and rescaleP are calculated as the

evaluator of the estimated network.

The result of cluster comparison will be presented in Chapter 4.

2.2 Functional component comparison

In previous discussion, it has been systematically shown that the networks can be

compared based on cluster structure from the topological perspective. The algorithm

has been reported to be able to discover potential cluster organization with high

accuracy [Newman 2004].

However, because the algorithm is originally designed for general network rather

than wholly for protein interaction network, it is not guaranteed that the details in the

algorithm will have biological significance. For example, the cluster structure is

31

emerged by continuously knocking out the edges with high betweenness value which

is calculated based on shortest path between any two vertices in the network. In

contrast, in molecular network, there is no evidence that two proteins must interact

with each other through the shortest path between them. The inconsistency between

the algorithm and biological mechanism will probably give rise to unexpected result

which will mislead the estimation of the networks, because the discovered cluster

structure, which will be used to judge a network, may not be the most potential one

hidden in the molecule functions in the network.

The functionality of proteins is probably the most significant property within an

interaction network. Therefore, in order to make the network estimation algorithm

biologically significant, the author tries to compare the network based on potential

functional component in a molecular network.

Because the network queried from databases are normally collected from various

literatures or are submitted by research groups from a range of background, it is hard

to distinguish the function component from the network directly. In other word,

given a network, there is no direct and determinate information about which proteins

are involved in a specific function. Fortunately, a heuristic functional component

identification algorithm in biological network has been investigated by Andrew

Pocklington in [Andrew 2005]. This algorithm provides a prediction mechanism to

assist the search for the molecules on basis of function and disease guided by the

topology of molecular interaction networks.

In attempting to predict the set of molecules underlying a given function or

phenotype, the algorithm proposed by [Andrew 2005] assumes that cellular

functions correspond to overlapping sub-networks which, when taken together,

comprise the entire molecular interaction network of the cell. As it was said above,

within this network, experimental data vary widely in terms of coverage, specificity,

noise and functional correlation, together with which make the network filled with

various uncertainty. Therefore, it is wise and straightforward to make use of

probability P(i|D) to measure the possibility of a protein i to be functionally relevant

given the data D.

32

2.2.1 Extrapolating probability

In most cases, the functionally relevant proteins are only partly known so that

extrapolation is required to assign probability values to the entire network. A

topology dependent score is defined by random walking between molecules within a

set.

Formally, given a subset of implicated molecules M which we have confidence that

they are functionally relevant, a way to estimate P(j|D) where j∉M is to marginalize

the joint probability P(j, i|D) where i∈M. By using Bayes rule, P(j|D) can be

calculated as:

P(j|D)=∑∈Mi

DiPDijP )|(),|(

where ),|( DijP shows the conditional probability of j being classified as

functionally relevant given i and data D. For expression convenience, data D will be

omitted in the following discussion.

)|( ijP will be estimated based on the relationship between topology and function. In

[Andrew 2005], )|( ijP is considered as the influence on j when knocking out

protein i, where the “influence” is measured by the change of vertex betweenness as

getting rid of protein i from the network. Instead of using edge betweenness as it was

in cluster comparison, the betweenness jB of vertex j is defined as the expected net

number of times a random walking between all pairs of other vertices passes through j,

averaged over all pairs, where “net” means that “if a walk passes through a vertex and

then later passes back through it in the opposite direction, the two cancel out and there

is no contribution to the betweenness”. [Newman 2003]2.

The random walking can be thought as message passing originating at a source

vertex s on a network and heading for some target t with no idea where t is. Thus, on

each step of its travel, the message moves from its current position on the network to

one of the neighboring vertices with the same probability [Newman 2003]2.

Particularly, suppose that a walk, starting at vertex s and making random moves

around the network until it finds itself at vertex t, comes to vertex j at one moment.

33

The probability that it will come to i on the next step is given by:

ijT =∑

kkj

ij

AA

for j ≠ t

where A is the adjacent matrix of the network in which ijA is 1 if there is an edge

between vertices i and j.

The formula can be rewritten as a matrix form:

1−⋅= DAT

where D is a diagonal matrix with iiD =∑k

kiA .

Since the algorithm will stop whenever it arrives at t, itT and tiT should be zero for all i.

Instead, the target vertex is denoted by removing column t and row t from T without

affecting transitions between any other vertices. The expression above becomes:

1///

−⋅= ttt DAT

where the subscript “/t” means the resulted matrix after removing column t and row t

from corresponding matrix.

For a walk from s, the probability that it comes to vertex j after r steps is given by

jsr

tT ][ / , and the probability that next step is at a neighboring vertex i is ∑

kkj

jsr

t

AT ][ / .

Summing over r from 0 to ∞ , the total number of times of traveling from j to i,

averaged over all possible walks, is ∑

−−

kkj

jst

ATI ])[( 1

/ , which can be denoted as a vector:

V= sTID tt ⋅−⋅ −− 1/

1/ )( = sAD tt ⋅− −1

// )(

where s is defined to be a vector such that

⎪⎩

⎪⎨

⎧=−=

=otherwise0

if 1 if 1

tisi

si

The net flow of the random walk along the edge from j to i is given by the absolute

34

difference | ji VV − | and the net flow through vertex i is half of the sum of

the flows on the adjacent edges:

⎪⎩

⎪⎨⎧ ≠−

= ∑otherwise 1

,for ||21 tsiVVA

I jjiij

i

Let stiI denotes the net flow of vertex i when taking s as the source vertex and t as the

target vertex. Then the vertex betweenness of i is the average of the flow over all

source-target pairs:

)1(21

−=

∑<

nn

IB ts

sti

i

where n is the total number of vertices in the network.

As it was discussed above, the conditional probability P(j|i) is calculated as the

change of j’s betweenness when removing i. With the result of vertex betweenness,

this conditional probability can be calculated as:

P(j|i)= || / ijj BB −α

where ijB / means the betweenness of vertex j when i is removed. The definition of

P(j|i) can be thought as measuring the influence of i on j by removing vertex i and

quantify the change caused by the removal. The parameter α determines the

influence of extrapolated probabilities and is set such that the maximum extrapolated

probability is equal to 1 in this project. For those vertices in the molecule subset M and

vertex j∉M which is disconnected from i∈M, define that P(i|i)=P(j|i)=1.

2.2.2 Scoring sub-network

With the result of P(i) for all molecules, the set of molecules encoding a function and

forming a sub-network can be scored according to the probability of these molecules

to be functionally relevant [Andrew 2005].

Let s(a, b) denotes the probability that the connection from a to b is functionally

relevant. The score s( ς ) for a set of molecules ς can then be calculated as the

35

average of s(a, b) over all possible pairs in ς .

According to the mechanism of protein interaction network, each path from a to b

bλaλλ n...1= represents a possible chain of interaction through which a and b may

influence each other. Let P( λ ) represent the probability of λ to be functionally

relevant, and )(λΦ denotes the probability that a interacts with b through path λ .

s(a, b) can be defined as:

s(a, b)= ∑Λ∈λ ab

λPλΦ )()(

where abΛ denotes the set of paths between a and b in which b doesn’t appear.

According to the assumption of random walking, )(λΦ is defined to be the

proportion of λ among all possible paths in random walking:

)(λΦ = 11 ))()()...(( −adλdλd n

where )( mλd is the degree of vertex m.

Suppose that the choice of path on one vertex is independent from other paths. P( λ )

can be defined as:

P( λ )=P(a)P( 1λ )P( nλ )P(b)

Therefore, the expression of s(a, b) can also be written as a matrix form by defining

an auxiliary matrix W as well as two column vectors Q and R such that

ijW =)()(

jdjPAij

⎩⎨⎧

≠=

=aiai

Qi if0 if1

bii AbPR )(=

where ijA is the adjacent matrix defined in last section. s(a, b) is written as:

s(a, b)= bbTbb

n

nb

Tb QTRQTR /

1///

0// )1(][ −

=

−=∑

The “/b” denotes the matrix/vector after removing row and column b. This guarantees

that b doesn’t appear as an intermediate vertex in a path [Andrew 2005].

As it is discussed at the beginning of this section, The score s( ς ) is defined as the

average of s(a, b) over all possible vertex pairs:

36

s( ς )=)1(

),(

∑∈≠

ςς

ςba

nn

bas

where ςn is the number of vertices in ς .

From the final expression of s( ς ), it can be deduced that as ςn decrease to be 1, the

value of s( ς ) will increase to be positive infinity. This indicates that the s( ς ) scores

will tend to decrease with the increase of ςn so that the algorithm will obvious prefer

the sub-network with less molecules, which is out of our expectation. In order to avoid

the influence of the size of sub-network, the original s( ς ) is converted to a z-code

[Andrew 2005]:

z( ς )=n

n

δµςs −)(

where nµ and nδ are the mean and standard deviation for sub-network with n

molecules estimated by randomly sampled sub-networks, and n is the number of

vertices in ς .

Particularly, the random sub-network is generated by creating an ordered list l of all

vertices in a network:

Initialization:

Initialize l with a randomly chosen vertex v;

Loop:

While not all vertices are added into l{

Create a random subset C of neighbors of v;

Append all vertices in C-l to l in a random order;

Set v to be a random vertex in C;

}

Return: ordered list l.

With l, the first n elements form a random sub-network with n vertices which has a

score s( nς ). The score mean nµ and standard deviation nδ can be estimated from a

37

sample of sub-networks with n vertices.

To achieve a stable estimation of nµ and nδ , starting from a sample of 100 lists l,

the number of samples was doubled until the variance of all nµ and nδ is less than

0.1%.

The variance is estimated by jackknife method [Efron 1979]:

1. For all samples of lists l, remove list il and calculate iρ/ , where iρ/ can either be

nµ or be nδ calculated with the lists after removing il .

2. Estimate the variance by:

∑=

−−

=n

ii ρρ

nnδ

1

2/ )ˆ(1ˆ

where ρ̂ is calculated by n

ρn

ii∑

=1/

.

2.2.3 Functional component searching

Recall that z( ς ) gives a score of the sub-network to be functional relevant. In order

to find the most possible set of molecules which are active in a specific function, the

sub-network with the highest z( ς ) sore is required. Since searching all possible

sub-network is a NP-complete problem, heuristic algorithm is needed. [Andrew

2005] suggests a Metropolis-type algorithm [Newman 1999] to search for the

assignment of active or inactive for which the set of active vertices has the maximum

z( ς ) score. The algorithm is described as follows:

Initialization:

Assign a random initial state to each vertex;

Set a “temperature” value T to be maxT ;

Aς is the set of vertices with active state;

Loop:

While T is not decreased to be minT {

38

Select a vertex v randomly, and flip the state. The resulted active vertices set is *A

ς .

Calculate the score change zδ = z( *Aς )-z( Aς );

If 0≥zδ , accept the change at v, else accept the change with probability Tze /δ ;

T=T× n

TT 1

max

min )( , where n is the total number of iterations;

}

While last 100 iterations increase the score by < 510− {

Flip the state of a randomly chosen vertex;

Accept the change if zδ >0 or zδ =0 but the number of active vertices increases;

}

Return: Aς .

According to [Andrew 2005], the value of T for which a decrease in score zδ =-x has

a 50% chance of being active is given by )(21 xT =x/ln(2). Therefore, maxT is set to

be )1(21T and minT is set to be )001.0(

21T .

2.2.4 Functional component comparison

The algorithm of functional component identification gives rise to a set of molecules

with the highest probability to be functionally relevant. It begins with a set of

implicated molecules and tries to extrapolate to others by maximizing the probability

of the active molecule set. With the same implicated molecules, different functionally

relevant proteins will be predicted based on diverse networks. Since the functionally

relevant molecules are resourced from the property of the network, the divergence on

the prediction result is also an embodiment of the variance in these networks.

Moreover, the predicted molecules can act as the proteins that we will spend money

on for future research. Therefore, if the estimated network can predict roughly the

same molecules as those in reference network, it can be judged to perform well.

Particularly, for function Func and network Net, the algorithm in previous sections

39

will give different active vertex set ),( NetFuncς A . Therefore, it is fairly

straightforward to compare the reference networks refNet with the estimated

network Net by contrasting the active molecule set ),( refA NetFuncς and

),( NetFuncς A and calculating the probability that the predicted active vertices in

Net are consistent with those active ones in refNet :

FuncP =),(

),(),(

refA

ArefA

NetFuncς

NetFuncςNetFuncς ∩

The detailed result about functional component comparison algorithm will be

reported in Chapter 4.

40

41

Chapter 3 Data retrieve

3.1 Data overview and constraint driven data retrieve

As it was stated in Chapter 1, the MASC database is used as the reference to estimate

the molecule interaction network from other databases. The available databases to be

estimated in this project include DIP, MINT, NetPro. The introduction to these

databases can be found in Appendix B.

Facing so many databases, the most straightforward way to retrieve data from them is

to manually set up fixed SQL statement in applications. However, these databases are

only a small part of existing molecule interaction network databases in the world so

that it is almost definite that further research will aggregate many more databases into

the system. The static SQL solution seems to be infeasible under such a situation

because this will make the program intractable and not flexible.

Consequently, the author tried to use a uniform language as the “container” of the data.

An obvious choice of such a language is XML. Normally, an XML document is

conformed to a schema such as DTD which is used for both generating XML from

relational database and parsing the data in applications. Nevertheless, given relational

schemas, the generation of XML schema is nontrivial. In this chapter, the author will

propose a way of automatically generating DTD from relational schemas with a set of

functional dependency as well as inclusion dependency. With the produced DTD,

XML can be generated and parsed with existing technique without much difficulty. In

this case, the whole process can be defined as constraint driven data retrieve.

Relational database is normally thought to be consisted of schemas and instances.

Therefore, most previous researches on relational database publish have mainly

focused on the mapping of schemas as well as instances from relational database to

XML schema which is usually DTD and XML documents. However, as long as

relational model can not contain “set” value because of its atomic value property,

integrity constraints are used to link individual schemas and thus play important roles

42

in the design and maintenance of relational databases. These constraints form another

perspective of the “semantic” of relational database which need propagating to the

XML schemas during database publish.

The proposed data retrieve technique in this project searches for principles for XML

schema design given existing relational schemas with a set of key constraints and

foreign key constraints. In this case, it will mainly focus on the transformation from

relational schemas to XML schema, says DTD, as well as the constraint propagation

from relational schema to the generated XML schema. Both key constrains and foreign

key constraint will be dealt with because these two sorts of constraints are the most

essential ones in relational schemas and thus catch more interest from researchers.

The detailed constraint driven data retrieve algorithm is presented in Appendix A. The

discussion begins with a formal definition of XML functional dependency (FD) which

is the counterpart of constraints in XML domain. A key constraint propagation

algorithm is given based on the definition of XML FD. Then, it will come up with a

foreign key constraint propagation algorithm which is accomplished by pre-mapping

and post-mapping steps. The key constraints and foreign key constraints are both

encoded with XML functional dependency in the generated XML schema.

3.2 Data retrieve implementation

As it was mentioned at the beginning of this chapter, facing various protein interaction

databases, the problem of this project is how to retrieve the data in these databases into

an XML document and how to parse the retrieved XML into semantic data used by

applications.

This can be an easy question if there is a guiding DTD. In Appendix A, an algorithm

has been investigated for propagating key constraints and foreign key constraints from

relational schema to XML schema. With pre-mapping and post-mapping algorithms, a

DTD with a set of functional dependency can be automatically generated given a set of

relational schemas and corresponding constraints. The algorithm has been tested,

although partly manually, before the summer project, on a real application which is a

mini Bayesian network workshop with complicated foreign key constraints.

43

Although the functional dependency of XML schema is the most valuable and original

result of the proposed constraint propagation algorithm, in this project, the DTD is our

main target and is further used to generate and parse XML for applications. Indeed,

there has been DTD directed database publishing technique using Attribute

Translation Grammars (ATG) which is a way of publishing relational data with XML

conformed to a predefined DTD [Michael 2002]. However, for simplicity, the data

mapping from relational databases to XML in this project is done by matching the

attribute names in databases and the element names in generated DTD. The data

records are inserted into XML document according to the matched name of DTD.

The program is implemented by Java using DOM (Document Object Model) parser

with some manual efforts to generate the DTD by propagating the constraint from

relational schema to XML schema. Fortunately, the relational design of protein

interaction network is relatively simple and is no more than many-to-many

relationship as it was discussed in Appendix A. Certainly, simply matching the

attribute names adds some limitation to the generalization of data retrieving, but this

approach is easy to be implemented and has little complexity, which is essential to

build up an efficient data retrieve mechanism.

Moreover, because different databases use divergent types of ID as the key of

molecule entries, it is required to construct the ID mapping between PPID which is the

key of proteins in MASC and other kinds of IDs used in those estimated databases.

Particularly, NetPro uses LocusLink, MINT uses SwissProt and DIP uses GenBank

accession number as their IDs respectively. The ID mapping is mined from various

databases including Ensembl EnsMart Genome Browser (http://www.ensembl.org/

Multi/martview), UniProtKB/Swiss-Prot (http://www.ebi.ac.uk/swissprot/) and

GenBank (http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html), etc. The

results of ID mapping are presented in Appendix C.

As it was said above, aside from the DTD, the set of functional dependency is

undoubtedly another important output of the constraint propagation algorithm. To put

it into a general circumstance, in the area of querying and storing XML, a more

common scenario is: given a specific DTD and a set of relational schemas, retrieve the

data in relational database into an XML document which is conformed to the DTD and

44

holds the key constraints and foreign key constraints in the relational schemas. In this

case, the algorithm in this chapter can serve as a tool for generating an “internal” XML

schema. The data as well as the constraints in relational schema are propagated only to

this internal XML schema rather than the predefined XML schema. These two

schemas are then matched by schema matching [Tova 1998] or object fusion [Yannis

1996] algorithm. The final propagation is finished by mapping the internal XML

document to the XML which is conformed to the predefined XML schema. This is a

more nontrivial problem for further research and is out of the scope of this dissertation.

45

Chapter 4 Result presentation and discussion

This project is implemented with Java programming language. In this chapter, the

author will present the result of experiments on the network retrieved from various

databases. A discussion will then be followed to state the merit as well as weakness

of each algorithm.

4.1 Cluster comparison

With the network estimation algorithms as well as the retrieved data from various

databases, it is time to experiment on real data and evaluate the result. In this section,

the author will firstly demonstrate the estimation result for single network queried

from NetPro, MINT as well as DIP compared with the reference MASC network.

Furthermore, to quantify the “meaning” of evaluation result, the cluster comparison is

tested on MASC network and a noisy version of MASC network itself by adding some

controlled random mutation. The result can be used to interpret the estimation results.

4.1.1 Single database estimation

The protein interaction network queried from MASC database consists of 96 proteins

connected by 222 interactions. The network automatically generated by application

according to the data queried from database is shown as Figure 4-1.

As it was discussed in previous chapters, this network will be used as the reference to

evaluate other networks queried from other databases. Before presenting the result of

cluster comparison, these networks to be estimated are given in Figure 4-2 and 4-3.

Since nothing can be queried from DIP database, there is no presentation for it in the

following discussion.

The performance of NetPro and MINT diverges from the first glance. NetPro

obviously contains more molecule as well as more interactions than those in MINT

and thus outperforms MINT as well as DIP. Because this project means to quantify

this kind of “goodness”, the cluster comparison will give rise to a score for each

46

estimated network.

Figure 4-1: The molecule interaction network queried from MASC database. It contains 96 proteins connected by 222 interactions. Those vertices highly connected to others are allocated relative near to the center in order to keep the network tidy.

Figure 4-2: The interaction network queried from NetPro database. 56 proteins and 94 interactions are involved.

47

Figure 4-3: the molecular interaction network from MINT database. It consists of 22 proteins with 16 interactions.

By running the cluster comparison algorithm, the cluster structures of these networks

are firstly discovered by gradually dividing the networks and searching for the largest

modularity values. Figure 4-4 shows the change of modularity value as the networks

are divided. As it was stated in Chapter 2, there is only one peak value when moving

down the dendrogram. The difference is that the peak values of these networks don’t

come as early as the ones in [Newman 2004], which indicates that the cluster

structures in these protein networks are not that obvious.

Figure 4-4: The modularity values of the queried networks in MASC, NetPro and MINT when moving down the dendrograms respectively. Each curve contains only one peak value indicating the most obvious cluster structure in each network.

The cluster structures discovered in these networks are shown in Figure 4-5 to Figure

4-7.

48

Figure 4-5: The cluster structure in MASC interaction network. There are 18 discovered clusters in which two of them dominate most of the proteins and interactions.

Figure 4-6: The cluster structure in NetPro database. There are 9 discovered clusters in which, again, two of them dominate most of the molecules as well as interactions.

49

Figure 4-7: The cluster structure in MINT database. It consists 7 clusters but performs obvious worse than NetPro.

With these cluster structures, the network can be compared by calculating the

probability that the molecules are correctly clustered as it was discussed in Chapter

2.

The result is shown in the table below:

DatabasesDatabase Estimation NetPro MINT DIP

Number of shared molecules with MASC 51 18 0

Number of correctly clustered molecules 23 6 0

Probability of correct clustering 24.0% 6.3% 0% As it was presented it in Chapter 2, the probability of correctly classifying the

molecules in MASC database using networks from other databases is calculated by

the number of correctly clustered proteins divided by the number of molecules in

MASC database which is 96 here.

According to the calculated cluster comparison scores listed in the table above, the

NetPro database obviously outperforms MINT and DIP, which is within our

expectation.

4.1.2 Robustness of cluster comparison algorithm

Normally, the constructed networks from databases contain some noise. The network

comparison should be robust to the noise to some extends. This means that, even

though the network is slightly different from the original MASC network, the

algorithm should “ignore” the noise and judge them as the same network. To

50

investigate the robustness of cluster comparison algorithm, the author compares the

MASC network with a noisy version of the same network by mutating the

interactions.

Particularly, the noise is added into the network by connecting or disconnecting the

molecules in MASC network with probability noiseP if the original edge is

disconnected or connected. Then the noisy network is estimated with respect to the

original MASC network by running the cluster comparison algorithm. The score of

the noisy MASC network compared with the original MASC network when the noise

increases from 0 to 90% is given in Figure 4-8. Each comparison score is averaged

on 30 experiments with the same amount of noise.

Because the noisy network contains exactly the same molecules as those in original

MASC network, the maximum comparison score is 1 indicating that they are the

same network when there is no mutation.

Furthermore, it is straightforward to read from Figure 4-8 that the clustering

comparison score for perturbed MASC network decreases exponentially with the

increase of noise and finally tends to converge to a low level at about 13%.

Exponential decrease of comparison score indicates that the cluster compassion

algorithm is not that robust to the noise in the network when the noise is within a

small range near 0 because slight mutation can result in huge change in comparison

score (decrease from 1 to 60%). Fortunately, for a network constructed from other

databases, the percentage of noise is normally out of this sensitive region (less than

1.5% noise).

Moreover, the comparison score converges to a non-0 value as the noise increases.

This indicates that cluster comparison algorithm can not discriminate differential

networks when the score is lower than a boundary which is around 16%.

For the network form NetPro, the maximum comparison score is 51/96=0.53125 so

that the comparison score, which is 24%, can be rescaled to be 45% which means

that NetPro can correctly cluster 45% of the molecules in the shared 51 proteins. On

the other hand, the cluster comparison score of MINT, 6.3%, can be rescaled to be

51

33% according to the maximum comparison score 18/96=18.8%. Therefore, both

original comparison scores and the rescaled scores which are within an appropriate

region (neither too sensitive nor too dull) designate that NetPro obviously

outperforms MINT.

Figure 4-8: The cluster comparison score of the noisy MASC network over the percentage of noise. There are three regions as the amount of noise increases. When noise <1.5% and the score is more then 60%, the comparison algorithm is over-sensitive and can not be used as a robust estimator. When 1.5%<noise<30% and the cluster comparison score is between 16% and 60%, the algorithm shows adequate ability and robustness to discriminate the noisy network so that it can be used as good indicator of network divergence. When noise>30% and the comparison score is smaller than 16%, there is little change on the score regardless of the noise percentage so that the score tells little about how good the network is. Note that the maximum score is 1 in that the noisy network contains all vertices in the original MASC network.

4.1.3 Databases collaboration

Having known that NetPro is good at constructing the molecule interaction network

based on the comparison with MASC database, we expect that the collaboration of

databases will perform even better. To explore this, the author combines NetPro and

MINT together to form a new network and calculates the comparison score.

In this situation, there are 52 shared proteins in the combined network and MASC

network. 27 correctly clustered molecules give rise to a better comparison score

27/96=28%. Besides, as it has been done before, this score can be rescaled to 52%

according to the maximum cluster comparison score of this network: 52/96=54%.

Although MINT demonstrates poor performance on constructing MASC network,

52

the combination of NetPro and MINT still shows a slight enhancement of

comparison score which verifies that the collaboration of databases can be an

possibility to construct better molecule interaction networks.

4.2 Functional component comparison

Up to now, the result of cluster comparison has been presented and shown reasonable

performance on discriminating differential networks from various databases. As it

was stated in last part, the cluster comparison can only be discriminative enough

when the rescaled score is neither too large nor too small in that the algorithm is

over-sensitive when the score is too large and it is over-dull when the score is small.

Moreover, although the cluster comparison result demonstrates certain ability to

estimate different networks, there is indeed no “correct” answer to verify these

results. In this case, another method will be tried to in order to validate the previous

algorithm.

The following sections will present the result of functional component comparison.

4.2.1 Component score and z-score

As it was mentioned in Chapter 2, the component score is calculated according to the

probability of individual molecule to be functionally relevant which is computed

using topological property (random walking) of the interaction network. Figure 4-9

depicts the histogram of the probability for those molecules (Spatial learning is used

as the function which will be presented soon). It is obvious that only a small part of

proteins have high probability to be involved in the function, which is consistent to

biological significance such that a cellular function is normally performed by a

bunch of proteins rather then all of the proteins.

According to the algorithm in Chapter 2, the functional component is searched by

heuristically climbing to a maximum s( ς ) score which is the probability of

component ς to be functionally relevant. Recall that the score s( ς ) tends to

decrease with the increase of the number of molecules in component ς and thus

will prefer the component with less proteins. Figure 4-10 depicts the component

53

score over the number of molecules in the component. Each score is calculated as the

means of 800 randomly sampled sub-networks as it was described in Chapter 2.

Figure 4-9: The histogram of the probability to be functionally relevant. Only a small part of molecules have large probability (in the right part of the figure), which is consistent with biological significance.

Since we have no prior knowledge about the size of functional component, this

preference of component size will mislead the functional component searching.

Figure 4-10: The component score over the number of molecules in the component. Each score is calculated as the mean of 800 samples, and the component contains at least 2 proteins or the score will go infinity.

Therefore, the algorithm uses z-score to recover this problem. Particularly, the mean

54

and standard deviation used to calculate z-score are subject to a low variance

computed by jackknife estimator. Figure 4-11 depicts the variance of mean and

standard deviation of component score with different sample capacity.

Figure 4-11: Jackknife variance of component score mean (left) and standard deviation (right) over the number of molecules in component. The variance values are plotted against different sample capacity 800, 1600 and 3200.

It can be seen that, regardless of sample capacity, the variance exponentially

decreases with the increase of the number of molecules in the component. This is

because that when there is only a small amount of proteins in a component, the

randomly sampled component vary a lot from one to another. On the other hand, as

the number of proteins growing, the sampled components contain more overlaps

which will reduce the variance.

Moreover, as the sample capacity increases, the value of variance decreases, which is

indicated by the curves moving towards the axis in Figure 4-11. This statistical fact

enables the variance to converge to zero by increasing the sample capacity, which

provides the foundation of finding stable means and standard deviation being subject

to small variance.

Therefore, with sufficient amount of component sample (about 3200 or 6400 in this

project), the mean and standard deviation with small variance can be computed.

These values are then used to normalize the component score s( ς ) to be z-score z( ς )

which has little bias on component size.

55

4.2.2 Single database comparison

The algorithm of searching for active vertex set with maximum z-score is applied on

the network from MASC database as well as other estimated database, NetPro and

MINT.

Take MASC network as an example. Figure 4-12 shows a typical situation of the

change of component z-score during the search algorithm. As expected, the score

keeps increasing until coming to a local maximum.

Figure 4-12: The change of component z-score when searching for the local maximum component z-score.

Those implicated proteins in MASC for schizophrenia and spatial learning have been

identified by manual curation of published literature. Therefore, with the same

implicated proteins, we compare the prediction result using the networks from

different databases.

For spatial learning, the implicated phenotype includes A0018 and A0011. The

predicted phenotype using MASC database is given in the table bellow. This

prediction shares 13 proteins with the result reported by [Andrew 2005] which

contains 20 predicted molecules. The difference is caused by using different version of

MASC network.

PPID A0001 A0002 A0003 A0007 A0008 Name NR1 NR2A NR2B ACTN Calmodulin

56

PPID A0010 A0011 A0012 A0016 A0017 Name Spectrin CAMK2A PLCg-1 Sap102 Tubulin PPID A0018 A0024 A0091 A0095 A0126 Name Src SynGAP Actin Dynamin Grb-2 PPID A0138 A0143 A0144 A0196 A0200 Name RAF1 AKAP 150 PKCepsilon RasGAP H-Ras PPID A0262 A0419 A1851 Name NOS1 Myoxin CaMKII Beta

On the other hand, the predicted phenotype using the network generated by NetPro

database is shown in the table bellow:

PPID A0002 A0003 A0018 A0020 A0030 Name NR2A NR2B Src PTP1D FAK2 PPID A0095 A0126 A0138 A0181 A0265 Name Dynamin Grb-2 RAF1 Erk2 Erk1 PPID A0268 A0434 A1851 Name Rsk-2 EGF-164 CaMKII Beta

These data show that NetPro can correctly recover 7 proteins to be functionally

relevant within 51 shared proteins with MASC database. Therefore, according to the

comparison method, the probability of NetPro network to correctly predict the

functionally relevant molecules for spatial learning can be computed as: learningSpatial

NetProP =7/23=30.4%

In contrast, the predicted molecules using the network from DIP database is shown in

the table bellow:

PPID A0002 A0011 A0012 A0018 A00124 Name NR2A CAMK2A PLCg-1 Src Src

It can be seen that there are 4 correctly predicted molecules by using MINT network

which contains 18 shared proteins with MASC database. Therefore, it can be

calculated that learningSpatialMINTP =4/23=17.4%.

57

Consequently, according to the predicted phenotype for spatial learning function

using networks from different databases, NetPro, again, outperforms MINT.

To verify this, the author uses schizophrenia as another function and calculate niaSchizophre

NetProP and niaSchizophreMINTP . For schizophrenia, the implicated proteins are A0010,

A0016 and A0107. The prediction result using the network from MASC is given in

the table bellow:

PPID A0002 A0003 A0008 A0009 A0010 Name NR2A NR2B Calmodulin PRKA9 Spectrin PPID A0011 A0012 A0014 A0015 A0016 Name CAMK2A PLCg-1 Chapsyn-110 DLG1 Sap102 PPID A0017 A0018 A0020 A0033 A0075 Name Tubulin Src PTP1D Rap2 Shank PPID A0081 A0091 A0095 A0107 A0123 Name Shank2 Actin Dynamin GKAP PP2B PPID A0138 A0143 A0144 A0266 A0292 Name RAF1 AKAP 150 PKCepsilon MEK1 INA PPID A0333 A1851 Name SNAP25 CaMKII Beta

To estimate NetPro database, the predicted phenotype is calculated using the same

implicated proteins and the network from NetPro:

PPID A0001 A0003 A0012 A0013 A0014 Name NR1 NR2B PLCg-1 DLG4 Chapsyn-110 PPID A0015 A0016 A0030 A0086 A0095 Name DLG1 Sap102 FAK2 ZO-1 Dynamin PPID A0107 A0112 A0138 A0144 A0168 Name GKAP b-catenin RAF1 PKCepsilon PKCbeta PPID A0266 A0268 Name MEK1 Rsk-2

It can be seen that there are 10 correctly predicted proteins using the network from

58

NetPro. Therefore, according to the comparison score, the probability of NetPro

network to correctly predict the functionally relevant molecules for schizophrenia

can be computed as: niaSchizophreNetProP =10/27=37.3%, where 27 is the number of predicted

molecules in MASC network.

The same computation on MINT database yields the proteins as follows:

PPID A0002 A0003 A0011 A0013 A0107 Name NR2A NR2B CAMK2A DLG4 GKAP

There are 4 correctly predicted molecules using MINT network for function

schizophrenia. Therefore, it can be calculated that learningSpatialMINTP =4/27=14.8%.

Up to now, the functional component comparison score for two functions both shown

that NetPro performs better than MINT. The same with cluster comparison, we aim

to “interpret” these scores and to see how discriminative these scores are to those

differential networks by controlling the difference and observing the tendency of the

scores.

4.2.3 Robustness of functional component comparison

The robustness of functional component comparison is researched under the same

methodology with cluster comparison. A noisy version of MASC network is

generated by adding or deleting the edges with specified probability. Then, the score

of this network is calculated with respect to the original MASC network. Figure 4-13

shows the scores of mutated networks over the percentage of noise. Each score is

calculated on 60 networks with the same proportion of noise.

It can be read from Figure 4-13 that the functional comparison score tends to

decrease with the increase of noise which stands for the controlled quantity of

“difference” of the network. For both functions, there is an abnormal score region

from around 0.7. This is probably because of the effect of local maximum. However,

the comparison scores in other region are smoothly changed, which means that the

functional component comparison score is appropriately discriminative to those

differential networks.

59

Figure 4-13: The functional component comparison score for the mutated network over the proportion of the noise. The smoothly changed score (except the region around 0.7) show pretty good discriminative ability to different networks.

4.2.4. Database collaboration

Cluster comparison method has shown slightly enhanced performance of the network

generated from combined NetPro and MINT databases. In this section, the database

collaboration is tested again under the criteria of functional component comparison.

Briefly, for spatial learning function, the combined network can correctly predict 9

proteins with the same implicated molecules, which forms an increased comparison

score 9/23=39.13%. Besides, the prediction accuracy of the mixed network in

schizophrenia is 10/27=37.3% which is unchanged compared with original NetPro.

Again, database combination seems to give a positive effect on the generated

network. However, since MINT shows fairly poor performance on constructing

MASC network, the joining of it is much like adding a drop of water in to a river.

Therefore, because of the limitation of available databases, we can not arbitrarily

judge the influence of database collaboration until more experiments on more

databases are done.

60

4.3 Discussion

From the results of previous sections, it shows that cluster comparison and functional

component comparison have come to a consistent result on the estimation of the

networks from different databases, which indicates that these two methods can have

some resolving power on those differential networks. However, as it was reported

earlier, these two algorithms have their weakness respectively.

For cluster comparison, it only works well within a relative narrow score region or it

will be either over-sensitive or over-dull to the change of the network. This is the

direct cause of hierarchical clustering algorithm. Since divisive clustering

continuously knocks out those edges with the highest betweenness score until the

network is divided into individual vertices, a slight mistake on the higher level of

dendrogram will result in tremendous change of the result. The noise within the

network is exactly a source of the mistake. As is can be seen in Figure 4-8, only less

then 1% noise can cause the cluster comparison score to decrease from 1 to 70% or

even lower.

As the permutation increases, the “noise” has been qualitatively changed to be

“errors” in the network. At this time, the cluster comparison algorithm is pretty much

capable of discriminating these errors by estimation score, which is indicated by the

smoothly decreased score with a reasonable speed in the middle part of Figure 4-8.

Moreover, no matter how the network is permutated, it will be finally divided into

clusters which are then matched to those in the reference network. As long as the

estimated network and reference network share certain amount of vertices, the

matched cluster structure will produce a non-zero score. This is why the cluster

comparison score converges to a non-zero score in the end. Therefore, when the

noise is too high, it is sill not appropriate to use cluster comparison algorithm

because the score is not discriminative enough to quantify the difference in the

network.

In this case, it can be concluded that cluster comparison algorithm can only be used

to estimate the network which is neither too different from the reference network nor

61

too similar to the reference network. This rule can be predominated by the scale of

cluster comparison score. Particularly, from the observation of Figure 4-8 as well as

the research during the project, a rescaled cluster comparison score between 0.2-0.6

will be the most appropriate region to use cluster comparison algorithm as the

estimator of those differential networks.

As to the functional component comparison method, it can be seen that it is better at

network discrimination because of its smoothly decreased comparison score when

the noise increases. However, the problem comes when there are only few predicted

molecules in the reference network. To put it into extreme, if there are only 2

predicted molecules in the reference network, and 1 in the estimated network, the

score will be 50% although there is indeed little information about the “goodness” of

the estimated network from this score.

Therefore, a restriction of functional component comparison is that the number of

predicted molecules from the reference network can not be too low. To quantify this,

we put a boundary of 15 which is the required lowest number of predicted proteins in

reference network.

Moreover, the identification of the functional component uses heuristic method to

search for the molecules with local maximum z-score. As a common sense, this

algorithm suffers from the problem of local maximum that requires multiple runs to

find a better result, which will make the algorithm computationally demanding.

All in all, it can be seen that both cluster comparison and functional component

comparison algorithms have their restriction in order to ensure that they can work

appropriately. In this case, they naturally form an enhancement to each other and can

be used together to estimate a network. A promising and systematical combination

method of these two algorithms should be further investigated and will be my future

work in this area.

62

63

Chapter 5 Conclusion and future work

5.1 Conclusion

After the long chase of detailed discussion of algorithms and the presentation of

analysis result, the differential protein interaction networks have been constructed

and contrasted based on both topological property and biological significance.

Cluster comparison considers the molecule interaction network to be consisted of

clusters which are those important components dominating the properties of a network

and thus standing for the difference of divergent networks. The cluster structure

identification algorithm looks for the edges in the network that are most “between”

vertices, and continuously knocks out these edges until only individual vertex is in the

cluster. In this project, the author brought forward a concept of concise graph which is

a more compact expression of original network without loss of information and is

especially useful for network with scale free property. The calculation of shortest path

betweenness can be promoted by message propagation and back propagation on

concise graph as well as efficient set operation such as hash set and hash map.

The process of dividing the network by removing the edges with the largest

betweenness can be expressed as a dendrogram in which cluster structure can be

acquired by cutting the tree somewhere. Modularity value of cluster structure is used

to decide the position on which the dendrogram should be cut.

With the cluster structures, taking MASC database as the reference network,

differential networks are estimated by the probability of correctly clustering those

molecules in MASC network (comparison score) or those molecules shared by MASC

and estimated network (rescaled comparison score). The comparison result gives rise

to an estimation of the networks from various databases and demonstrates that the

network from NetPro outperforms those of MINT as well as DIP database.

To understand the meaning of cluster comparison outcome, the robustness of the

algorithm is tested by contrasting a noisy MASC network with the original one. The

64

result shows that cluster composition can only be discriminative enough when the

rescaled comparison score is within a range (0.2-0.6).

Because of the limitation of cluster comparison method as well as the inconsistency

between shortest path betweenness and biological mechanism, functional component

comparison is adopted to verify the conclusion from cluster comparison and to

investigate a contrast method based on biological significance.

The functional component identification algorithm assigns a probability to be

functionally relevant to each molecule in the network. The probability values of

implicated proteins in annotation are set to be 1 and are then used to calculate the

marginal probability of those extrapolated proteins based on the topological property

of the network.

With the probability of each molecule, a search algorithm is applied to find a

component with the highest possibility to be functionally relevant. It uses a

Metropolis-type algorithm which is a heuristic method searching for a local maximum

z-score. Although computationally intensive, several runs of the searching algorithm

can achieve pretty reasonable and robust comparison result.

The algorithm is carried out on two synaptic functions. Although rooted from

biological significance, functional component comparison achieves the same

estimation result on NetPro, MINT and DIP as it was in cluster comparison based on

topological property.

Moreover, both cluster comparison and functional component comparison indicate

that collaboration of different databases has the possibility to enhance the

performance of the combined network. Nevertheless, more databases are required for

the verification of this statement because the poor performance of MINT and DIP can

not provide enough evidence of promotion.

Aside from two comparison algorithms based on different perspectives of protein

interaction network, the author also presents a constraint driven data retrieve

technique. The algorithm utilizes the functional dependency (key constraint) and

inclusion dependency (foreign key constraint) in relational schema to construct XML

schema which is DTD in this project as well as a set of functional dependency in

65

XML domain. The result DTD is used to guide the data retrieve from relational

database as well as the parse and validation of the retrieved XML document.

5.2 Future work

Following the result of this project, a ton of further investigation is required.

In the first place, more databases shall be researched in order to verify the capability

of two comparison algorithms on differential networks especially in the area of

robustness, discriminative strength, computational efficiency, etc.

It can be seen that there was little presentation about the interpretation of the cluster

structure as well as the functional component identification result. For example, from

the result of functional component identification, there is significant overlap in

prediction between different phenotypes, and the likelihood of a protein having

multiple predicted phenotype increases with larger degree in the network [Andrew

2005]. Although these phenomenon are beyond the substance of this project, deeper

research is deserved based on more abundant data.

Moreover, the only reference database in this project is the one from MASC database.

The estimation of other databases based on this sole network will give rise to bias

which means that the databases performing well on MASC network are not

necessarily good at constructing other networks. Therefore, when comparing the

networks, more constraints such as protein domain, specious, interaction type, etc

should be considered in order to narrow down the scope of comparison result.

As to the data retrieve technique, constraint propagation from relational schema to

XML schema is now an intensive research area and has no uniform standard.

Therefore, the algorithm in Chapter 3 should be tested with much more care as well

as more complicated scenarios. Besides, as it was mentioned in Chapter 3, the

generated XML schema with a set of functional dependency can be used as an

internal schema during the database publishing by matching the given schema with

this internal schema. This idea has its value on database publish with constraint

which is a demanding but weak area.

66

In all, the investigation of this project gives rise to a bunch of cogent results which

can contribute to current research of MASC network. Two differential protein

interaction network comparison algorithms as well as the constraint driven data

retrieve mechanism provide a fundamental starting point for further research which

the author will continue devoting into.

67

Appendix A Constraint Driven Data Retrieve

The discussion begins with a formal definition of XML functional dependency (FD)

which is the counterpart of relational constraints in XML domain. A key constraint

propagation algorithm is given based on the definition of XML FD. Then, it will come

up with a foreign key constraint propagation algorithm which is accomplished by

pre-mapping and post-mapping steps. The key constraints and foreign key constraints

are both encoded with XML functional dependency in the generated XML schema.

A.1 XML functional dependency and key constraint propagation

To propagate constraints from relational schema to XML, the counterpart of these

constraints in XML has to be firstly set up in order to describe the semantic of the

constraints in XML domain. This section will mainly focus on a definition of XML

functional dependency based on previous work. The algorithm for key constraint

propagation will be given at the end of this section.

A.1.1 Formalism of DTD and XML

It has been a popular approach to express DTD by context free grammar which has

more solid theoretical foundation. However, in this section, an extended formal

definition of DTD [Wenfei 2001] will be used as the most essential concept for the

definition of functional dependency.

In the first place, for convenience of explanation, the author lists some reserved

notation that will be widely used:

ELE is a set of all element names in DTD;

ATT is a set of attribute names in DTD. Each element in ATT starts with “@”;

VAL is a set of string values in attributes and elements;

ENID is a set of element identifiers which uniquely identify an element in an XML

document.

68

Definition: A DTD is defined as a quintuple vector D<E, A, TPM, BE, root> where

(1) E ⊆ ELE is a set of element names;

(2) A ⊆ ATTR is a set of attributes;

(3) TPM is a type mapping from E to element type definition which is a string value or

a regular expression. For each e∈E, TPM(e)∈VAL which means that TPM(e)=string,

or a regular expression which is denoted as r → e′ | r|r |r+r| *r |ε , where e′ ∈E, “|”

means union, “+” means string connection, “*” is Kleene closure and ε denotes

empty element;

(4) BE is a mapping from E to A which indicates the affiliation between elements and

attributes. Suppose @a∈A, we say that @a is defined on e∈E iff @a∈BE(e);

(5) root∈E is the root of DTD.

For example, according to the definition above, the DTD segment below can be

rewritten by the formal definition of DTD as the right part:

Based on the definition of DTD, a DTD path is used to retrieve the elements and

attributes in a DTD. The DTD path is defined as below:

Definition: A DTD path for a given DTD D<E, A, TPM, BE, root> is simply a string

s= nsss ...10 and is denoted as s>D iff:

(1) 0s =root;

(2) for i>0 and i<n, is ∈E ∪ string is included in the regular expression sequence of

TPM( 1−is );

(1) E={Course_root, Course, Name, Lecturer};

(2) A={@cid};

(3) TPM(Course_root)=Course*; TPM(Course)= Name+Lecturer;

TPM(Name)=string; TPM(Lecturer)=string;

(4) BE (Course)={@ cid};

(5) root=Course_root.

<!ELEMENT Course_root (Course*)>

<!ELEMENT Course (Name, Lecturer)>

<!ATTRIBUTE Course cid

CDATA #REQUIRED>

<!ELEMENT Name (#PCDATA)>

<!ELEMENT Lecturer (#PCDATA)>

69

(3) ns ∈E ∪ A ∪ string is either included in the sequence of TPM( 1−ns ) or ns =@att,

where @att∈ BE( 1−ns ). The latter case indicates that ns is an attribute which is

defined on 1−ns .

In the following discussion, *D ={s|s > D} denotes all of the paths in DTD D,

while −D ={s= nsss ...10 | s∈ *D ∧ ns ∈E} denotes the set of paths ending with elements.

It can be inferred that *D - −D is the paths ending with attributes or string value.

In the example above, *D ={Course_root, Course_root+Course,

Course_root+Course+Name, Course_root+Course+Name.string,

Course_root+Course+Lecturer, Course_root+Course+Lecturer.string,

Course_root+Course+@cid}, while −D ={Course_root, Course_root+Course,

Course_root+Course+Name, Course_root+Course+Lecturer}, where “+” means

connection of string.

Till now, the XML schema has been defined and can be accessed by DTD path. As the

instance of XML schema, an XML document can also be modeled as a tree-structured

graph [Peter 2001].

Definition: An XML tree is defined as XT<N, ENIDM, ELEM, ATTM, xroot>

(1) N ⊆ ENID is a set of nodes in XT, each element in N is an unique identifier of a

node in XML;

(2) ENIDM is a mapping from N to ELE, which combines the node identifiers in XML

with specific elements in DTD;

(3) ELEM is a mapping from N to *NVAL ∪ , and is the counterpart of TPM in the

definition of DTD;

(4) ATTM maps a node in N and an attribute to a specific value in VAL, and can be

denoted as ATTM: VALATTN →× ;

(5) xroot∈N is the root of XT.

In this definition, N is the whole set of elements in an XML. For nodes n, n’∈N, if n’

∈ELEM(n), we say that n is the parent of n’, or n’ is the child of n in this XML.

70

The definition above only defines the structure of an XML document. It doesn’t

explicit the relationship between XML and corresponding DTD. The definition of

XML document which is conformed to a DTD can be directly defined [Wenfei 2001]

without any more work, but this will be delayed until the next section after the

definition of DTD realization——DTD cast function.

A.1.2 DTD cast and XML functional dependency

Now, the DTD and XML have been formally defined separately. The problem comes

to how an XML can be generated from a DTD and can be conformed to a DTD as well.

This is a same problem with the realization of class to be an object. With the same

thought, the generation of XML from a given DTD can be done by a “cast” function.

There has been some original work on this such as [Marelo 2002], but most of these

contributions are limited within the DTD which is generated by single relational

schema (table). This concept is generalized to multiple schema situations here by

foreign key constraint propagation in next section.

Definition: A cast function cas of DTD D<E, A, TPM, BE, root> is defined as a

mapping from *D to ENID ∪ VAL ∪ null such that:

(1) if s∈ −D , cas(s)∈ENID ∪ null;

(2) if s∈ *D - −D , cas(s)∈VAL ∪ null;

(3) for s1, s2∈ENID, if cas(s1)=cas(s2), then s1=s2, This is the same semantic of

“node equality” in [Peter 2001];

(4) For s= nsss ...10 , if cas( 1−is )=null, then cas( is )=null, where ni <≤1 .

All possible cast functions on DTD D is denoted by: CAS(D)={cas| ∃ s(s∈ *D ∧

cas(s) ≠ null)}.

It can be seen that each non-null cas rule in CAS(D) maps (realizes) a DTD path to an

element or a string value in XML. In this case, as it will be seen, the whole set of cast

rules will be a realization from a DTD to an XML tree. Moreover, cast function sets up

the relationship between XML and DTD which are separately defined in the previous

section.

71

Definition: A path s= nsss ...10 is legal for a DTD D<E, A, TPM, BE, root> iff

cas∈CAS(D).

According to the definition of CAS(D), cas( nsss ...10 ) ∈ CAS(D) means that for

ni ≤≤0 , cas( isss ...10 ) ≠ null.

From now on, CAS(D) implicitly refers to the set of cast functions on legal DTD paths

of D. Moreover, the following discussion assumes that the used paths are all legal.

As it was said above, there have been previous work on the formalization of XML, but

most of the research paid much attention to the data or content of XML instead of the

semantics [Susan 2002] as it was defined above. The cast function combines the XML

and DTD together and provides an opportunity to specify an XML tree conformed to a

DTD according to a cast function.

However, the definition of cast function above is only a general mapping from DTD

path to XML elements or string values for elements and attributes. In order to ensure

that an XML is conformed to a corresponding DTD, an XML tree based on the

definition of cast function can be derived as below:

Definition: For a given DTD D<E, A, TPM, BE, root> and a corresponding cast

function cas∈CAS(D), an XML denoted by cas_tree(D, cas) is an XML tree XT<N,

ENIDM, ELEM, ATTM, xroot> such that:

(1) N={n|( n∈ENID) ∧ ∃ s (s∈ *D ∧ n=cas(s))};

(2) for a legal DTD path s= msss ...10 , and a node n ∈ N, if n=cas(s), then

ENIDM(n)= ms . This rule shows that if a path is cast to an element identifier in an

XML, the node is mapped to the last element in DTD path by ENIDM function. This is

exactly a simple semantic of XPath;

(3) if n=cas(s), then ELEM(n)={cas( 's )| 's =s+I ∧ cas( 's )∈CAS(D), I ∈E ∪ VAL},

where “+” means connection of DTD path;

(4) given @att∈ A, if n=cas(s) and cas(s+@att)∈ CAS(D), then ATTM(n,@att)=

cas(s+@att), where “+” means connection operation.

72

The definition of cas_tree(D, cas) itself is the process of realizing a DTD to be an

XML according to a cast function on the DTD. In this case, the cas_tree function acts

as an “assignment” of a DTD. The result of the assignment is an XML document.

For example, given a DTD segment D:

<!ELEMENT Books (book*)> <!ELEMENT Book (Title, Copys*)> <!ATTRIBUTE Book ISBN CDATA #REQUIRED> <!ELEMENT Title (#PCDATA)> <!ELEMENT Copys (Copy*)> <!ELEMENT Copys (CopyNumber, Borrowed)> <!ELEMENT CopyNumber (#PCDATA)> <!ELEMENT Borrowed (#PCDATA)>

and a cast function cas:

cas(Books)= 0n ;

cas(Books+Book)= 1n ; cas(Books+Book+@ISBN)=’123456’; cas(Books+Book+title)= 2n ; cas(Books+Book+title.string)=’Java Core’; cas(Books+Book+Copys)= 4n ; cas(Books+Book+Copys+Copy)= 5n ;

cas(Books+Book+Copys+Copy+CopyNumber)= 6n ; cas(Books+Book+Copys+Copy+CopyNumber.string)=’1’; cas(Books+Book+Copys+Copy+Borrowed)= 7n ; cas(Books+Book+Copys+Copy+Borrowed.string)=’Yes’;

where string means the string value of the element.

Figure A-1: The XML tree generated by cast function. The items in brackets are element identifier assigned by the cast function, and the elements outside the brackets are the assigned DTD elements or attributes.

The generated XML tree by cas_tree(D, cas) can be denoted by Figure A-1. The items

in brackets are element identifier assigned by the cast function, and the elements

73

outside the brackets are the assigned DTD elements or attributes.

It can be seen that if there is a different set of cast function on the same DTD, the

generated tree will be expanded to be a bigger one. For example, another cast function

is defined as:

cas(Books)= 0n ;

cas(Books+Book)= 1n ; cas(Books+Book+@ISBN)=’123456’; cas(Books+Book+title)= 2n ; cas(Books+Book+title.string)=’Java Core’; cas(Books+Book+Copys)= 4n ; cas(Books+Book+Copys+Copy)= 8n ;

cas(Books+Book+Copys+Copy+CopyNumber)= 9n ; cas(Books+Book+Copys+Copy+CopyNumber.string)=’2’; cas(Books+Book+Copys+Copy+Borrowed)= 10n ; cas(Books+Book+Copys+Copy+Borrowed.string)=’No’;

The generated XML tree together with the previous one can be expressed as Figure

A-2.

Figure A-2: The generated XML tree by two different cast functions. The subtree rooted at 8n is extended by the second cast function.

It can be seen that as long as cas∈CAS(D), the repeated use of different cas can

produce arbitrarily complex XML conformed to a specific DTD D.

The definition above explains how to construct an XML tree with cast function.

However, arbitrary cast including the cast of illegal DTD path is usually not

74

conformed to a DTD or can be derived from an arbitrary XML. A more common

situation is to restrict a set of cast functions from a given XML and DTD, which is

defined as follows.

Definition: Given a DTD D and an XML tree XT that is conformed to D, a minimal set

of cast function CAS(D,XT) is defined as:

{cas| cas∈CAS(D), cas_tree(D, cas) is a subtree of XT, sca ′¬∃ ∀ s ( sca ′ ∈CAS(D),

s∈ *D , (cas ≠ sca ′ ∧ sca ′ (s) ≠ null) → sca ′ (s)=cas(s))}.

This means that there are no two different cast functions (cas and sca ′ ) in CAS(D,XT)

which make sca ′ (s)=cas(s) to be true if sca ′ (s) ≠ null. This definition makes sure

that all cast functions in CAS(D,XT) for a specific DTD D and an XML XT are not

redundant——any different cast function will generate different element, attribute or

string value in an XML.

After a long travel about the definition of cast function, the XML functional

dependency can be finally defined as below:

Definition: Given a DTD D and an XML tree XT, for S, S ′ ⊆ *D , S ′ is functional

dependent (FD) on S, which is denoted by S → S ′ iff:

∀ cas ∀ sca ′ ( sca ′ , cas∈ CAS(D,XT), ∀ s(s∈ S ∧ cas(s) ≠ null ∧ cas(s)= sca ′ (s))

→ s′∀ ( s′ ∈ S ′ ∧ cas( s′ )= sca ′ ( s′ ))) is true, where “ → ” in this statement is the

logic operation “imply” rather then “dependency”.

Because all of the discussion in the following sections is about the usage of XML

functional dependency, the examples for this definition will be delayed until the

detailed discussion of constraint propagation.

Till now, the functional dependency on XML schema has been defined based on cast

function which is a realization mechanism of DTD into XML. Primitively, all these

definitions are established on DTD path and the formalized expression of DTD and

XML.

In next section, the relational functional dependency propagation will be solved by

re-expressing the same semantic of key constraint in XML domain——XML

75

functional dependency. After this, the foreign key constraint propagation will be

discussed in more complex situations.

A.1.3 Key constraint propagation

The purpose of this section can be generally described as: given a relational schema

T( 1A , 2A , …… nA ) and corresponding key constrain RFDΣ , find a DTD D<E, A, TPM,

BE, root> as well as a set of FDs FDΣ which have the same semantic as T and RFDΣ .

The whole process is a function mapping from a relational schema with key constraint

to XML schema with corresponding functional dependency.

Given a relational schema T( 1A ,…, nA ) with key attributes { 1kA ,…, m

kA } ⊆ { 1A ,…,

nA }, the corresponding DTD D<E, A, TPM, BE, root> with FDs FDΣ can be derived

by the algorithm TableDTD(T) which is described as below:

(1) E={T_root,T} ∪ ({ 1A ,…, nA }-{ 1kA ,…, m

kA });

(2) A={@ 1A ,…,@ nA };

(3) TPM(T_root)=T*;

(4) Suppose { 1nkA ,…, p

nkA }={ 1A , … , nA }-{ 1kA , … , m

kA }, TPM(T)= 1nkA +…+ p

nkA ; for

all inkA ∈{ 1

nkA ,…, pnkA }, TPM( i

nkA )=ε ;

(5) BE (T)={@ 1kA ,…, @ m

kA }; for all inkA ∈{ 1

nkA ,…, pnkA }, BE ( i

nkA )={@ inkA };

(6) TΣ ={{T_root+T+@ 1kA ,…, T_root+T+@ m

kA } → {T_root+T+ 1nkA , T_root+T

+ 1nkA +@ 1

nkA ,…, T_root+T + pnkA , T_root+T + p

nkA +@ pnkA }};

The process above can be denoted as (D, FDΣ )= TableDTD(T). Particularly, the root

element T_root is the corresponding element of the whole table in relational schema T;

element T stands for a tuple in the relation. Moreover, it can be seen that TableDTD

maps the key attributes of relational schema to the direct attributes under element T,

and maps the non-key attributes to the XML attributes under the sub elements in T.

The key constraint is re-expressed in XML domain by (6). The FDs can be dived to be

76

two parts: {T_root+T+@ 1kA , …, T_root+T +@ m

kA } → {T_root+T+ inkA } and

{T_root+T+@ 1kA ,…, T_root+T+@ m

kA } → {T_root+T+ 1nkA +@ i

nkA }. These two

kinds of FDs mean that the key attributes in an XML will not only uniquely identify

the value of those non-key attributes but also uniquely identify the elements in the

upper level of those non-key attributes. This is slightly different from the notation in

relational schema. As it will be seen in the following sections, this property of XML

FDs is extremely essential for propagating the key constraints as well as foreign key

constraints.

As an example of key constraint propagation, the relational schema Course(cid, Name,

Lecturer) can be mapped to an XML schema D<E, A, TPM, BE, root> with a set of

FDs FDΣ as:

E={Course_root, Course, Name, Lecturer};

A={@cid, @Name, @Lecturer};

TPM(Course_root)=Course*; TPM(Course)= Name+Lecturer;

TPM(Name)=ε ; TPM(Lecturer)=ε ;

BE (Course)= {@cid}; BE (Name)={@Name}; BE (Lecturer)={@Lecturer};

FDΣ ={{Course_root+Course+@cid} → {Course_root+Course+Name,

Course_root+Course+Name+@Name, Course_root+Course+Lecturer,

Course_root+Course+Lecturer+@ Lecturer }}.

It can be seen that the construction of XML schema for a single relational schema with

key constraint is relative easy. This is partly because that the key constraint is

straightforward and doesn’t contain the nontrivial structure such as recursion and loop.

Moreover, the key constraint is the constraint within a single table and thus doesn’t

interact with other constraints. In next section, it will be seen that when dealing with

foreign key constraint which is a kind of inter-schema relationship, things will be

messy, and there will be some trick to solve the foreign key propagation.

77

A.2 Foreign key constraint propagation

Presently, each individual relational schema with key constraint has been converted to

be XML schema with FDs. However, there is seldom single table in a relational

database, and a more common situation is that tables are always connected with

foreign keys and form a set of inter-schema constraints which always give rise to more

difficult problems when we want to keep these constraints in the generated XML

schema. The following sections will mainly focus on these nontrivial situations of

foreign key constraint propagation and finally come to a formal algorithm.

A.2.1 Foreign key driven pre-mapping

In this section, the author will firstly present a partial mapping method from relational

schemas with foreign key constraints to XML schema with FDs. This is done by

following the foreign key constraints among schemas and continuously aggregating

the schemas generated by key constraint propagation to form a bigger XML schema.

However, it will be seen that this method can not preserve all of the foreign key

constraints so that there will be an “adjusting” step after this pre-mapping.

First of all, a graphical expression will be used to describe the foreign key constraints

among relational schemas.

Definition: Given relational schemas DB( 1T , 2T , …, nT ), where each iT is a single

relational schema, a foreign key graph GDB<TB, FORMAP, TBroot> is defined as

below:

(1) TB={ 1T , 2T , …, nT };

(2) FORMAP( iT )={ jT | jT contains foreign key, which is the primary key of iT , as jT ’s

own primary key or part of primary key};

(3) TBroot={ iT | ¬∃ jT ( iT ∈FORMAP( jT ))}.

For example, the relational schemas below are simple ones where the foreign key is

ISBN in relation Copy:

78

These schemas can be expressed as GDB<TB, FORMAP, TBroot>, where

(1) TB={Book, Copy};

(2) FORMAP(Book)={Copy};

(3) TBroot={Book}.

More complex examples will be given during the discussion below.

It can be seen that foreign key graph doesn’t contain the detailed information of each

schema which has been converted into the DTD graphs from each of these individual

schemas by key constraint propagation.

In the following sections, the construction of XML schema based on different kinds of

foreign key graphs will be discussed and formalized to be the pre-mapping algorithm.

A.2.2 One to many relationship and weak entity

The most commonly used foreign key is the so called “one to many” relationship,

which means that schema jT contains only the foreign key from another schema iT as

its own primary key or part of primary key. This situation is the simplest one and can

be dealt by adding the corresponding XML schema of jT as a child element of iT . For

example, the DTD graphs for Book-Copy schema can be individually designed as

Figure A-3 below following the key constraint propagation algorithm.

Figure A-3: The DTD generated by key constraint propagation from relational schema Book and Copy.

The generated key constraints within these two schemas are:

{Book_root+Book+@ISBN} → {Book_root+Book+Title, Book_root+Book+@Title,

Book(ISBN, Title, Author) Copy(ISBN, CopyNumber, Borrowed)

79

Book_root+Book+Author, Book_root+Book+Author+@Author};

{Copy_root+Copy+@ISBN, Copy_root+Copy+@CopyNumber} → {Copy_root+Copy+Borrowed, Copy_root+Copy+Borrowed+@Borrowed}.

Recall that all of the values in relation are stored in the attributes of XML, so the

elements Tile, Author, Borrowed here all have their own attributes for storing the data.

They are omitted for convenient expression.

It can be seen that relation Copy is actually a weak entity and need the foreign key

from another entity which is Book to uniquely identify a copy of a book. The emerged

DTD based on this one-to-many foreign key constraint between Book and Copy is

shown in Figure A-4:

Figure A-4: The result DTD of foreign key constraint propagation from Book-Copy relational schema. The foreign key attribute @ISBN has been removed because it can be derived from the one in Book schema.

The FDs in this DTD is modified to be:

{Book_root+Book+@ISBN} → {Book_root+Book+Title, Book_root+Book+@Title, Book_root+Book+Author, Book_root+Book+Author+@Author};

{Book_root+Book+@ISBN, Book_root+Book+Copy_root+Copy+@CopyNumber} → {Book_root+Book+Copy_root+Copy+Borrowed, Book_root+Book+Copy_root+Copy+Borrowed+@Borrowed}.

What have been done here is putting the DTD iD for relational schema iT

containing foreign key from jT as a child of the DTD jD of jT where the foreign key

comes from, extending the paths in iD ’s FDs with corresponding paths of jD and

rewriting the whole FDs in iD with the new paths. Moreover, the foreign key

attribute in iT from jT has been knocked out because they can be derived from its

80

parent iT . By doing this, the foreign key constraint is converted to a set of FDs in DTD.

For example, the previous foreign key constraint:

{Copy_root+Copy+@ISBN, Copy_root+Copy+@CopyNumber} → {Copy_root+Copy+Borrowed, Copy_root+Copy+Borrowed+@Borrowed}.

in the example above can be written as:

{Book_root+Book+@ISBN, Book_root+Book+Copy_root+Copy+@CopyNumber} → {Book_root+Book+Copy_root+Copy+Borrowed, Book_root+Book+Copy_root+Copy+Borrowed+@Borrowed}

where Book_root+Book+Copy_root+Copy+@ISBN has been knocked out from the

XML schema for relational schema Copy and is substitute by Book_root+Book+

@ISBN which is the key of relational schema Book.

An exciting result has come out: the foreign key constraints in relational schema are

changed and expressed by functional dependency in XML! The reason for this is that

the “set” value which is forbidden in relational schemas can be easily realized in XML

because of its semistructure property.

From this perspective, it seems to be easy to propagate foreign key constraint in this

sort of “one to many” relationship. However, things will be tricky if more schemas are

involved.

A.2.3 Many to many relationship

Another sort of common foreign key constraint among schemas is many-to-many

relationship which contains two or more foreign keys from other schemas. This is

depicted as the Student-Enroll-Course example:

Figure A-5: The DTD generated by key constraint propagation from relational schemas Course, Student and Enroll.

These schemas say that schema Enroll has foreign keys both from Student (Student.sid)

81

and Courses (Course.cid). Because Student-Enroll and Course-Enroll are respectively

one-to-many relationship, if the same effort is tried as it was in one-to-many situation,

a problem will be: which DTD should be chosen to contain Enroll as a child. This

problem can be described as the foreign key graph in Figure A-6. The solid arrow

means that Enroll has foreign keys from both Student and Courses. It can be seen that

these two arrows clash at relation Enroll, which causes the problem said above.

Figure A-6: The foreign key graph of Student-Enroll-Course relational schema. The solid arrows show that relation Enroll contains foreign key from Student and Course. The clashed arrows are solved by reversing one of them depicted by dashed line.

The proposed solution for this embarrass here is straightforward: break the clashed

arrows by reversing one of the arrows to the other direction, as it is shown by the

dashed arrow in Figure A-6. In this case, it is obvious that Enroll should be put in

Course as a child, and Student should be put into Enroll as a child. Problem seems to

be solved.

However, the reversion of the arrow changes the semantic of foreign key constraint.

Obviously, it is fairly abnormal to see that schema Student contains foreign key from

Enroll. To see the problem, let’s explore more about the example above.

Before emerging the XML schemas into one schema, the original key constraints of

these three individual schemas can be written as:

{Course_root+Course+@cid} → {Course_root+Course+Name, Course_root+Course+@Name, Course_root+Course+Lecturer, Course_root+Course+Lecturer+@Lecturer};

{Student_root+Student+@sid} → {Student_root+Student+Name, Student_root+Student+Name+@Name}

{Enroll_root+Enroll+@cid,Enroll_root+Enroll+@sid} → {Enroll_root+Enroll+Grade, Enroll_root+Enroll+Grade+@Grade}.

To see the problem caused by using the simple one-to-many approach, the author

constructs the DTD and an XML which is conformed to the DTD by adopting the rules

in one-to-many situation except for conserving the key in Student and knocking out

82

@sid and @cid in Enroll which contains multiple foreign keys.

In this new DTD, according to the algorithm in one-to-many relationship, the original key constraint in Student should be rewritten as:

{Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name, Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name+@Name}

Figure A-7: Left: The DTD constructed by reversing the arrow in foreign key graph. Right: the corresponding XML conformed to the DTD. The circles demonstrate the problem of reversing the arrow. The key attributes of Enroll can not uniquely identify the elements of Name which are E1 and E2.

However, is this really true? From the XML in the right of Figure A-7, it can be seen

that for path Course_root+Course+Enroll_root+Enroll+Student_root+Student+

@sid, any two cast functions having the same value ‘00001’ on it will have the same

cast result which is ‘Sb’ for path Course_root+Course+Enroll_root+Enroll+

Student_root+Student+Name+@Name. However, these casts for path Course_root+

Course+Enroll_root+Enroll+Student_root+Student+Name can either be E1 or E2 as

denoted in Figure A-7. This says that the functional dependency:

{Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name}

is no longer held after the inversion of the arrow in the foreign key graph of Figure

A-6.

This exception will cause serious problem when we update or delete the data in an

XML conformed to this sort of DTD. For example, if we modify the attribute value in

Name element denoted as E1, the element E2 doesn’t know what has happened. The

83

situation will be: two students with the same student ID will have different names,

which is obviously not consistent with the original key constraint in Student.

Because of the reasons above, the semantic of generated DTD is not consistent with

the original constraint in relational schema. The solution here is just reserving the

inconsistency and recording it by the missing FDs in the newly generated DTD in

pre-mapping step. Then, one can solve the inconsistency in the post-mapping step

which will be described later.

Because the foreign key constraint:

{Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name}

is no longer held, it is deleted as an “evidence” of the problem described above. All of

the operations above result in the FDs in generated DTD below:

{Course_root+Course+@cid} → {Course_root+Course+Name, Course_root+Course+Name+@Name, Course_root+Course+Lecturer, Course_root+Course+Lecturer+@Lecturer};

{Course_root+Course+@cid, course_root+Course+Enroll_root+Enroll+ Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+Grade, Course_root+Course+Enroll_root+Enroll+Grade+@Grade};

{Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+ Student+Name+@Name}.

Generally, for the schema that contains more than two foreign keys from other

schemas, which is shown in the foreign key graph of Figure A-8, the solution is almost

similar: keep one of the arrows and reverse others.

Figure A-8: A more general situation of many-to-many relationship in which D contains several foreign keys from other relations. The solution of pre-mapping maintains one arrow and reverses all of others.

Figure A-9 gives the resulted structure. The details in each schema are not specified

and are exactly the same with the Student-Enroll-Course example.

84

Figure A-9: A brief expression of DTD after reversing the arrows in Figure A-8.

As a simple explanation, the reversed arrows in foreign key graph put the XML

schemas for A and C as the children of D. The reversion problem is denoted by the ill

FDs in A and C, which is similar with Student-Enroll-Course example.

A.2.4 More complex relationship

Until this point, the author has discussed two the most common situations of foreign

key constraint propagation. However, things are usually more complex than this.

Think about the scenario in Figure A-10.

Figure A-10: The foreign key graph of Researcher-Model-Experiment relational schema. Experiment has foreign key from both Researcher and Model, which forms a many-to-many relationship, and Model has foreign key from Researcher, which is a one-to-many relationship.

It says that a researcher can set up several models, and thus Model contains the key of

Researcher as a part of primary key and becomes a weak entity. This situation is

nothing more than a one-to-many relationship which has been solved before. However,

when we come to the relation Experiment, things will become complicated.

Relation Experiment says that it contains foreign key from both Researcher which

indicates who has carried out the experiment, and another foreign key from Model

which shows the model the experiment is based on. If one tries the solution for

many-to-many relationship on Researcher and Model, it will be found that it is

impossible to degrade either Researcher or Model to be the child of Experiment.

85

Particularly, if one simply reverses the arrow from Researcher to Experiment, the

foreign key graph will be recursive (a loop); while if one reverses the arrow from

Model to Experiment, the same problem of Experiment will come up with relation

Model.

The way to solve this problem here is to “split” relation Researcher into two copies

and break the circle structure in the foreign key graph in Figure A-10. Then reverse the

arrow from the replicated schema (new Researcher). The foreign key graph is

transformed to be two one-to-many relationships. This can be depicted in Figure A-11.

Figure A-11: The transformed foreign key graph by duplicating Researcher and reversing the arrow from Experiment to new Researcher.

Generally, this process contains two steps: reverse the arrow from Researcher to

Experiment, and then break the circle by splitting Researcher into two copies.

Before going on to the construction of FDs in the generated XML schema, let’s look at

a more general situation which is shown in Figure A-12.

Figure A-12: The foreign key graph of a more general situation. The relations in A series and B series propagate the foreign key constraints to the same relation C.

It can be seen that the relations in A series and B series propagate the foreign key

constraints to the same relation C. This is a more general situation of hybridized

many-to-many relationship and the complex relationship discussed above.

No matter what has happened on the foreign key constraint when coming from A0 to C,

the solution is the same: keep one arrow to C unchanged and reverse other arrows to C;

86

and then split those schemas which are pointed by the reversed arrows. The result by

doing this can be expressed as Figure A-13.

Figure A-13: The resulted foreign key graph by keeping one arrow to C unchanged, reversing other arrows to C, and splitting those schemas pointed by the reversed arrows.

The result from the solution above is obviously inconsistent with the original

semantics of the foreign key constraint, especially when the nodes are spited into two

copies. The lost information, in this, case is recorded by the ill FDs in the newly

generated DTD. Specifically, for the Researcher-Model-experiment example, if the

relational schemas are:

Researcher(rid, rname) Model(rid, mid, mname) Experiment(rid1, rid, mid, result)

where rid1 is the foreign key from Researcher and rid, mid is the foreign key from

Model. The FDs in DTD after pre-mapping the schemas of Researcher-Model-

experiment should be:

{Researcher_root+Resercher+@rid} → {Researcher_root+Resercher+rname, Researcher_root+Resercher+rname+@rname};

{Researcher_root+Resercher+@rid, Researcher_root+Resercher+Model_root+ Model+@mid} → {Researcher_root+Researcher+Model_root+Model+mname, Researcher_root+Resercher+Model_root+Model+mname+@mname};

After schema Experiment is added in, according to the one-to-many solution, the new

FDs for schemas Researcher and Model in DTD are unchanged. The FDs for

Experiment and duplicated Resercher are:

{Researcher_root+Resercher+@rid, Researcher_root+Resercher+Model_root+ Model+@mid, Researcher_root+Resercher+Model_root+Model+Experiment_root +Experiment+Researcher1_root+Researcher1+@rid} → {Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+result, Researcher_root+Resercher+Model_root+ Model+Experiment_root+Experiment+result+@result}

87

{Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+ Researcher1_root+Researcher1+@rid} → {Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+Researcher1_root+Resercher1+rname+@rname}

where Researcher1 is the duplication of Researcher.

Note that, the constraint

{Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+ Researcher1_root+Researcher1+@rid} → {Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+Researcher1_root+Resercher1+rname}

is eliminated for the same reason to form the ill FD as it was said in many-to-many

relationship.

Because of the duplication and arrow inversion in foreign key graph, the pre-mapping

causes the same ill FDs as those in many-to-many relationship and will give rise to

insertion and updating problems. This will be solved in post-mapping step later after

next section which will formalize pre-mapping algorithm.

A.2.5 Formalized pre-mapping algorithm

Before furthering on the formal algorithm, the author defines the hierarchy of foreign

key graph which will be convenient for the following discussion.

Definition: Given a foreign key graph GDB<TB, FORMAP, TBroot> and T∈TB, the

level of T, which is denoted as level(T), is defined to be the length of the longest path

staring from the nodes in TBroot to T. Besides, level(GDB, i) is defined as

{T|level(T)=i}.

For example, in the Researcher-Model-Experiment example, level(Researcher)=0,

level(Model)=1, level(Experiment)=2.

Definition: Given a foreign key graph GDB<TB, FORMAP, TBroot>, the diameter of

GDB is defined as diam(GDB)=max{level(T)|T∈TB}.

For example, the diameter of Researcher-Model-Experiment example is 2 which is the

longest path from Researcher to Experiment in the foreign key graph.

Definition: Given a foreign key graph GDB<TB, FORMAP, TBroot>, T∈ TB, the

88

function pre(T) is defined as {T’| FORMAP(T’)=T}.

For instance, in Researcher-Model-Experiment example, pre(Researcher)=null,

pre(Model)= {Researcher}, pre(Experiment)={Researcher, Model}, where null means

an empty set.

Definition: Given two DTDs D<E, A, TPM, BE, root> with FDs FDΣ , and D’ <E’, A’, TPM’, BE’, root’> with FDs FDΣ′ , an extension of D by D’ at node e∈E with DTD

path es ∈ −D , which is denoted by DeU D’, is defined as a new DTD newD < newE ,

newA , newTPM , newBE , newroot > with FDs newFDΣ where:

(1) newE = E ∪ E’;

(2) newA = A ∪ A’;

(3) TPM(e)=TPM(e)+ troo ′ ; newTPM = TPM ∪ TPM’, where “+” means the connection of regular expression;

(4) newBE =BE ∪ BE’;

(5) newroot = root;

(6) For any DTD path s’ in FDΣ′ , s’= s’+ es ; newFDΣ = FDΣ ∪ FDΣ′ , where “+” stands for

path connection.

This is exactly the formalized description for how to add a DTD D’ as a child of

another DTD D and modify the corresponding DTD paths in FDs set. All of the

examples in previous sections were practically done with the same approach here and

thus could serve as good examples.

The following algorithm concludes the process of DTD pre-mapping with a formal

description:

Given a foreign key graph GDB<TB, FORMAP, TBroot>, the construction of the DTD

D<E, A, TPM, BE, root> with FDs FDΣ regarding GDB can be derived as:

Initialization:

E, A, TPM, BE, root are all initializaed to be empty or null.

Loop:

(1) Establish root node DB: E= E ∪ DB;

89

(2) For all kT ∈TBroot, D= DDBU TableDTD( kT ),

(3) For i=1 to diam(GDB){

Let T= level(GDB, i); //collect all relational schemas on level i

For all kT ∈T{

Let parent=pre( kT );

if the number of elements in parent is 1{

Let p∈parent;//p is the only item in parent

D’=corresponding DTD segment of p in D;

Let e=element under the root element of D’; //only one element under root

Let tempDTD=TableDTD( kT );

fka= {a| a is a foreign key attributes of kT from p};

tempDTD.A= tempDTD.A-fka; //delete the foreign key attributes

D= DeU tempDTD;

//change the DTD paths for foreign key in kT from p to the paths for key of p

In FDΣ , substitute all paths for attributes in fka with the paths of key in p;

}

else{

//insert the schema which has more than one arrow pointing to it

choose one p∈parent;

Let D’ =corresponding DTD segment of p in D;

Let e=element under the root element of D’;

Let tempDTD=TableDTD( kT );

fka= {a| a is a foreign key attribute of kT from p};

tempDTD.A= tempDTD.A-fka;

D= DeU tempDTD;

In FDΣ , substitute all paths for attributes in fka with the paths of key in p;

90

//now insert the schema to which the reversed arrows point into D as the children of kT ’ s DTD

Let D’ =corresponding DTD segment of kT in D;

Let e =element under the root element of D’;

for all p’∈parent-p{

//duplicate p’

tempDTD=TableDTD(p’);

D= DeU TableDTD(p’ );

fka= {a| a is a foreign key attribute of kT from p’};

In FDΣ , substitute all paths for attributes in fka with the paths of key in p’;

//Mark the problem caused arrow reversion and relation duplication by ill FDs

nfka= {a| a is the non-key attribute of p’};

for all a∈nfka{

//denote the ill FDs

In FDΣ , eliminate all paths containing s+a, and only keep s+a+@a;

}

}

}

}

}

Although the algorithm seems to be complex and tricky, it is nothing more than the

discussion in previous sections. It can be seen that the algorithm of pre-mapping works

on relational schemas with an increasing order of level values which is the longest path

from the root relation, and gradually aggregates all relational schemas into one XML

schema. During the process, the inversion of arrow and replication of relational

schema will cause inconsistency of the semantic between relational schema and XML

schema which will cause deletion and updating problem. These problems have been

marked by the ill FDs in XML schema and will be disposed soon in post-mapping step.

91

A.2.6 Post-mapping

For several times, we have seen that the reversion of arrows in foreign key graph and

duplication of relations schemas in foreign key graph cause the inconsistency of

constraint semantic and deletion and updating problem in pre-mapping step.

Fortunately, these problems have been recorded in the functional dependency set of the

generated DTD. This section will mainly focus on a method which can eliminate the

inconsistency caused by pre-mapping.

The inconsistency problems for a DTD D<E, A, TPM, BE, root> with FDs FDΣ is

denoted by the statement: S ⊆ D*, s’∈D*, S → s’+X+@X∈ FDΣ , but S → s’+X∉ FDΣ .

The inconsistency problems will both cause update and deletion failure as it was

described in the Course-Enroll example.

The solution to the problem can be concluded as transforming a DTD D with FDs FDΣ

into a new DTD D’ with new FDs FDΣ′ which doesn’t contain the ill FDs.

Recall that in relational database design, the update and deletion problems are solved

by splitting one schema into multiple schemas and getting rid of the redundant

attributes in one relational schema. The situation in XML is different. There is only

one XML schema for a set of relational schemas so that it is impossible to split an

XML schema into several pieces. However, XML schema contains branches. In this

case, is it possible to extract those sub-trees and put them on other braches within an

XML? The situation below is an example:

In the Course-Enroll-Student example, to catch the semantic of the key of Student,

which means that each Name and the attributes value of Name (@Name) should be

uniquely identified by @sid, the Student branch is spitted into another subtree. This

subtree is named with “Student_Info_root”. At the same time, all non-key attributes

and elements in Student branch are discarded, and only the key attributes are

conserved in order to avoid the redundancy and eliminate the ill FDs. The resulted

DTD can be expressed as the DTD graph in Figure A-14.

The Student_Info_root subtree is established before Student according to pre-mapping

algorithm because relational schema Student has 0 level value and is one of the root

92

schemas (other one is Course). The Student is indeed the duplication of relational

schema Student. Besides, root DB is added by the formalized pre-mapping algorithm.

This is why the DTD in Figure A-14 is slightly different from the one in Figure A-7.

Figure A-14: The DTD without ill FDs after post-mapping. A branch called Student_Info_root conserves all attributes of relational schema Student, while the non-key attributes in Student branch are eliminated so that the ill FD is deleted.

It can be seen that the redundant information, Name, in Student has been knocked out

so that the ill FD:

{DB+Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → { DB+Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name+@Name}

without

{DB+Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → { DB+Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name}

disappears. The original key constraint of relational schema Student is represented by

the FD in the new branch Student_info in XML:

{DB+Student_Info_root+Student_Info+@sid} → {DB+Student_Info_root+ Student_Info+Name+@Name, DB+Student_Info_root+Student_Info+Name}.

This again catches the semantic of key constraint of Student in relational schema.

Besides, the key attribute @sid in Student and the one in Student_Info are actually

from the same attribute of Student in relational schema. To denote this fact, a new

constraint is established:

{DB+Student_Info_root+Student_Info+@sid} → {DB+Course_root+Course+Enroll_

93

root+Enroll+Student_root+Student+@sid}.

The more general many-to-many example in previous section is converted to a new

DTD in post-mapping step as the simplified DTD graph in Figure A-15. The non-key

attributes in A and C have been discarded and the new FDs, which indicate the

relationship between the key attributes in A_Info_root, C_Info_root and the retained

key attributes in the A and C, are denoted by the dashed arrows. Again, A_Info_root,

C_Info_root and DB were established in pre-mapping algorithm before coming to A

and C.

Figure A-15: The simplified DTD of the general many-to-many relational schema after post-mapping step. The non-key attributes in A and C are discarded and the new FDs, which indicate the relationship between the key attributes of A_Info_root, C_Info_root and the retained key attributes in A and C, are denoted by the dashed arrows.

Moreover, in the Researcher-Model-Experiment example, the DTD with ill FDs is

converted to a new DTD in Figure A-16. Again, the non-key attributes in the replicated

Researcher have been deleted and the new FD is expressed by the dashed arrow from

the key attributes of original Researcher_Info to the key attributes of replicated

Researcher. The root elements are all omitted for expression convenience.

The examples above show the post-mapping step for various sorts of relational design.

The formal algorithm will be given in the following discussion.

According to the algorithm of pre-mapping, because the relational schemas are added

into the generated DTD with an increasing order of level value which is the longest

path from root relational schemas, it can be proved that when the DTD segment of a

particular relational table is constructed, all of the DTD segments for the parent

relational schemas in foreign key graph have been constructed. In this case, the

94

algorithm of post mapping is actually a process of elements combination of those

duplicated subtrees. The algorithm is described as below:

Figure A-16: The simplified DTD of Researcher-Model-Experiment relational schema after post-mapping step. The non-key attributes in replicated Researcher have been deleted and the new FD is expressed by the dashed arrow from the key attributes of original Researcher_Info to the key attributes of replicated Researcher.

Given a DTD D<E, A, TPM, BE, root> with FDs FDΣ , for S ⊆ D*, s∈D*, a∈E and

@a∈A, if there is an ill FD S → {s+a+@a}∈ FDΣ but S → {s+a}∉ FDΣ , according to

the pre-mapping algorithm, there must be another two FDs S’ → {s’+a+@a},

S’ → {s’+a}∈ FDΣ , where s and s’ are the paths for duplicated subtree and original

subtree for a same relational schema respectively. Post-mapping algorithm contains

two parts: D=D/(S → {s+a+@a}) and FDΣ = FDΣ /(S → {s+a+@a}), where

D/(S → {s+a+@a}) is a new DTD D’<E’, A’, TPM’, BE’, root’> such that:

(1) E’=E;

(2) A’=A;

(3) TPM’= TPM;

TPM’(s)=TPM’(s)-a; //”-” means deleting the element with name ‘a’ from the type definition of s

(4) BE’=BE;

(5) root’=root.

and FDΣ /( S → {s+a+@a}) =( FDΣ -{S → {s+a+@a}}) ∪ {S’ → S}.

With the definition of D/(S → s+a+@a) and FDΣ /(S → s+a+@a), the post-mapping

can be easily defined as:

95

post-mapping(D, FDΣ ){

for each ill FD S → s+a+@a∈ FDΣ but S → s+a∉ FDΣ {

D= D/(S → s+a+@a);

FDΣ = FDΣ /(S → s+a+@a);

}

}

Up to now, the foreign key propagation has been solved by pre-mapping and

post-mapping algorithms. The input of the whole algorithm is a set of relational

schemas as well as a set of foreign key constraints and key constraints; and the output

of the algorithm is a DTD with a FD set which contains equivalent semantic of foreign

key constraints and key constraints in relational schemas.

The DTD will be further used as a guidance to generate XML documents from

relational databases and parse these documents for applications.

96

97

Appendix B Estimated databases in project

NetPro. The introduction is from: http://www.biobase.de/pages/products/netpro.html.

NetPro, the proprietary protein interaction database covers more than 100,000 expert

curated and annotated protein-protein interactions. NetPro has been built using

interaction data extracted with the proprietary information extraction engine,

M-CHIPS™ and have been cross validated through manual curation. All the

interactions are from peer reviewed published scientific literature and have gone

through significant quality checks in terms of expert cross-checking by Molecular

Connections' in-house scientific team.

The NetPro database is organized in standard relational database format with built in

front-end to effectively navigate and analyze the data. NetPro data is linked to public

Ids, LocusLink, facilitating integration of interactions information into proprietary drug

discovery databases. The protein-protein interactions in NetPro are complemented with

annotations of the profiled proteins with scientific literature information on a variety of

important subjects, including: Cellular Localization, Biological Pathways, Species,

Experimental Technique, Diseases implicated, Gene Ontology.

MINT. The introduction is from http://mint.bio.uniroma2.it/mint/ and [Zanzoni

2002]

MINT is a relational database designed to store interactions between biological

molecules. Beyond cataloguing binary complexes, MINT was conceived to store other

types of functional interactions, including enzymatic modifications of one of the

partners. Both direct and indirect relationships are considered [Zanzoni 2002].

Furthermore, MINT aims at being exhaustive in the description of the interaction and,

whenever available, information about kinetic and binding constants and about the

domains participating in the interaction is included in the entry. MINT consists of

entries extracted from the scientific literature by expert curators. The curated data can

be analyzed in the context of the high throughput data. Presently, MINT focuses on

experimentally verified protein interactions with special emphasis on proteomes from

mammalian organisms.

DIP. The introduction is from http://dip.doe-mbi.ucla.edu/.

98

The DIP database catalogs experimentally determined interactions between proteins. It

combines information from a variety of sources to create a single, consistent set of

protein-protein interactions. The data stored within the DIP database were curated, both,

manually by expert curators and also automatically using computational approaches

that utilize the knowledge about the protein-protein interaction networks extracted from

the most reliable, core subset of the DIP data.

99

Appendix C ID mapping

Because MASC (PPID), NetPro (LocusLink), MINT (SwissProt) and DIP (GenBank)

databases use different IDs, It is required to set up mapping among these IDs in order to

construct the corresponding network from those estimated databases using PPID which is

used in MASC reference network. The ID mapping is mined from various databases

including Ensembl EnsMart Genome Browser (http://www. ensembl.org/Multi/martview),

UniProtKB/Swiss-Prot (http://www.ebi.ac.uk/swissprot/), GenBank (http://www.ncbi.nlm.

nih.gov/Genbank/GenbankSearch.html), etc. The results are listed bellow. Only those PPIDs

appearing in MASC network are available here. Note that PPIDs are those 5 characters

beginning with “A”.

ID mapping from PPID to LocusLink:

A0001 2902 A0001 14810 A0002 2903 A0002 14811 A0003 2904 A0003 14812 A0007 88 A0007 11472 A0008 801 A0008 805 A0008 808 A0008 12314 A0008 12315 A0008 12313 A0009 10142 A0009 100986 A0010 6712 A0010 20742 A0011 815 A0011 12322 A0012 5335 A0012 18803 A0013 1742 A0013 13385 A0014 1740 A0014 23859 A0015 1739 A0015 13383 A0016 1741 A0016 53310 A0017 10376

A0018 6714 A0018 20779 A0020 5781 A0020 19247 A0022 2898 A0022 14806 A0024 8831 A0024 221504 A0030 2185 A0030 19229 A0033 5911 A0033 76108 A0037 5864 A0039 6857 A0039 20979 A0040 6804 A0040 20907 A0075 84258 A0075 50944 A0081 22941 A0083 2017 A0083 13043 A0086 7082 A0086 21872 A0091 71 A0095 1759 A0095 13429 A0103 2911 A0103 14816 A0104 9456 A0104 26556

A0106 18783 A0107 9229 A0112 1499 A0112 12387 A0114 493 A0117 4905 A0117 18195 A0118 84867 A0118 19259 A0119 1213 A0123 5530 A0126 2885 A0126 14784 A0132 5515 A0134 4133 A0134 17756 A0138 5894 A0138 110157 A0143 9495 A0144 5581 A0144 18754 A0145 5566 A0145 18747 A0149 4644 A0149 17918 A0152 5901 A0153 5290 A0153 18706 A0162 9479 A0162 19099 A0164 26400

A0166 27347 A0166 53416 A0168 5579 A0181 5594 A0188 26060 A0188 72993 A0194 5879 A0196 4763 A0196 18015 A0200 3265 A0200 15461 A0203 5898 A0204 2316 A0214 8997 A0215 2596 A0215 14432 A0216 7871 A0216 83997 A0217 2521 A0217 233908 A0219 1000 A0219 12558 A0222 2932 A0222 56637 A0231 10093 A0231 26140 A0231 68089 A0231 101100 A0232 10094 A0232 56378 A0233 10109

100

A0238 6812 A0238 20910 A0259 5582 A0260 5170 A0260 18607 A0262 4842 A0265 5595 A0265 26417 A0266 5604 A0266 26396 A0266 26395 A0267 5605 A0267 26396 A0267 26395 A0268 6197 A0268 110651 A0276 11113 A0276 12704 A0277 23237 A0277 11838 A0285 8927 A0285 12217 A0286 7158 A0287 1846 A0287 319520 A0288 9145 A0288 20972 A0289 5536 A0289 19060 A0290 3897 A0290 16728 A0291 1828 A0292 9118 A0292 226180 A0298 94104 A0327 2915 A0330 23236 A0330 18795 A0331 2597 A0331 14433 A0331 407972 A0333 6616 A0333 20614 A0340 476 A0340 232975 A0340 11928 A0342 478 A0350 4155 A0350 17196 A0351 4340 A0351 17441

A0352 5354 A0352 18823 A0363 5862 A0363 59021 A0364 5870 A0364 19346 A0365 326624 A0365 58222 A0366 6129 A0366 90193 A0366 19989 A0367 6137 A0367 270106 A0368 23521 A0368 22121 A0369 3945 A0369 106557 A0369 16832 A0369 16828 A0370 3939 A0370 106557 A0370 16832 A0370 16828 A0371 5052 A0371 18477 A0372 7001 A0372 21672 A0373 53616 A0375 4747 A0375 18039 A0376 4741 A0376 18040 A0377 498 A0377 11946 A0379 509 A0379 11949 A0395 291 A0396 292 A0396 11740 A0417 830 A0417 12343 A0418 832 A0418 12345 A0419 2934 A0419 227753 A0421 377 A0421 11842 A0422 6506 A0422 20511 A0423 2806 A0423 14719

A0424 2752 A0424 14645 A0425 7416 A0425 22333 A0426 7417 A0426 440866 A0426 22334 A0427 5313 A0427 18770 A0428 230 A0429 50 A0429 11429 A0430 4001 A0430 16906 A0431 4729 A0432 9588 A0432 11758 A0433 7167 A0433 21991 A0434 7422 A0434 22339 A0435 4356 A0435 13384 A0436 4355 A0436 50997 A0437 12308 A0437 12307 A0438 1627 A0438 56320 A0439 12034 A0440 9211 A0441 2171 A0441 16592 A0442 1808 A0442 12934 A0443 8604 A0443 78830 A0444 10919 A0444 110147 A0445 6252 A0445 104001 A0446 1434 A0446 110750 A0466 208 A0466 11652 A0470 121512 A0470 224014 A0473 10369 A0473 54376 A0473 12300 A0476 572

A0476 12015 A0484 3667 A0484 16367 A0488 405 A0488 11863 A0489 5602 A0489 26414 A0491 5606 A0491 26397 A0605 1737 A1851 816 A1851 12323 A1925 5211 A1925 18641 A2007 15482 A2084 55054 A2084 77040 A2331 4629 A2331 77579 A2331 71960 A2331 17880 A2343 11474 A2344 81 A2344 60595 A3166 4627 A3166 17880 A3166 17886 A3202 4625 A3202 17888 A3202 140781 A3773 79751 A3773 68267 A3958 4430 A3958 17912 A3961 4628 A3961 17880 A3961 17886 A3968 64837 A3968 16594 A3976 3983 A3976 226251 A4014 11005 A4015 167691 A4015 75782 A4048 103910 A4048 67938 A4049 4637 A4049 17904

101

ID mapping from PPID to SwissProt:

A0001 Q05586 A0001 P35439 A0001 P35438 A0002 P35436 A0002 Q12879 A0002 Q00959 A0003 Q01097 A0003 Q13224 A0003 Q00960 A0007 Q9JI91 A0007 P35609 A0008 P02593 A0009 Q9JHE0 A0009 Q99P24 A0009 Q99996 A0010 Q62261 A0010 Q01082 A0010 Q9QWN8 A0011 P11275 A0011 Q9UL21 A0011 P11798 A0012 Q62077 A0012 P19174 A0012 P10686 A0013 Q62108 A0013 P31016 A0013 P78352 A0014 Q63622 A0014 Q91XM9 A0014 Q15700 A0015 Q62696 A0015 Q12959 A0015 Q62402 A0016 Q62936 A0016 P70175 A0016 Q92796 A0017 P04687 A0017 P02551 A0018 P05480 A0018 P12931 A0018 Q9WUD9 A0020 P41499 A0020 P35235 A0020 Q06124 A0022 P39087 A0022 P42260 A0022 Q13002 A0024 Q9UGE2 A0024 Q9QUH6 A0030 Q9QVP9

A0030 P70600 A0030 Q14289 A0033 Q9D3D5 A0033 P10114 A0037 P20336 A0037 P05713 A0039 P21707 A0039 P21579 A0039 P46096 A0040 P32851 A0040 Q16623 A0040 O35526 A0075 Q9WV48 A0075 Q9Y566 A0081 Q9WUV9 A0081 Q9P1 A0083 O70420 A0083 Q60598 A0083 Q14247 A0086 Q07157 A0086 P39447 A0091 P02571 A0095 P39053 A0095 Q05193 A0095 P21575 A0103 Q9EPV6 A0103 Q13255 A0103 P23385 A0104 O96003 A0104 Q8K3E1 A0104 Q9QUJ8 A0106 P50393 A0106 P47713 A0106 P47712 A0107 Q9D415 A0107 P97836 A0107 O14490 A0112 Q02248 A0112 Q9WU82 A0112 P35222 A0114 Q64542 A0114 P23634 A0117 P46460 A0117 P46459 A0117 Q9QUL6 A0118 P54830 A0118 P54829 A0118 P35234 A0119 P11442 A0119 Q00610

A0123 P20652 A0123 Q9WUV7 A0123 Q08209 A0124 P25388 A0126 Q60631 A0126 P29354 A0132 P13353 A0132 P05323 A0134 P15146 A0134 P20357 A0134 P11137 A0138 Q99N57 A0138 P04049 A0138 P11345 A0143 P24588 A0143 P70593 A0144 Q02156 A0144 P09216 A0144 P16054 A0145 P05132 A0145 P17612 A0145 P27791 A0147 P08129 A0149 Q9QYF3 A0149 Q9Y4I1 A0149 Q99104 A0152 P17080 A0153 Q9Z1L0 A0153 P42336 A0153 P42337 A0162 Q9WVI9 A0162 Q9R237 A0162 Q9UQF2 A0164 O35406 A0164 O14733 A0166 O88506 A0166 Q9Z1W9 A0166 Q9UEW8 A0168 P04410 A0168 P04411 A0168 P05771 A0181 P27703 A0181 P28482 A0188 AAM55531 A0188 Q9G1 A0194 P15154 A0194 Q923X0 A0196 Q04690 A0196 P21359 A0196 P97526

A0200 P01112 A0200 Q61411 A0200 P20171 A0203 P11233 A0203 P05810 A0204 P21333 A0204 Q9JJ38 A0214 P97924 A0214 O60229 A0215 P07936 A0215 P06837 A0215 P17677 A0216 Q8VC86 A0216 Q9HCH1 A0217 P35637 A0217 P56959 A0219 P15116 A0219 Q9Z1Y3 A0219 P19022 A0222 P18266 A0222 P49841 A0222 Q9WV60 A0231 O15509 A0232 Q9JM76 A0232 O15145 A0233 Q9CVB6 A0233 O15144 A0238 O08599 A0238 Q64320 A0259 P05697 A0259 P05129 A0260 O55173 A0260 O15530 A0260 Q9Z2A0 A0262 P29476 A0262 P29475 A0262 Q9Z0J4 A0265 Q91YW5 A0265 P21708 A0265 P27361 A0266 Q01986 A0266 P31938 A0266 Q02750 A0267 Q63932 A0267 P36506 A0267 P36507 A0268 P18654 A0268 P51812 A0276 Q9QX19 A0276 O88938

102

A0276 O14578 A0277 Q9UJW6 A0277 Q63053 A0277 Q9WV31 A0281 Q9Y438 A0281 Q8K2R1 A0285 O43161 A0285 O88778 A0285 O88737 A0286 P70399 A0286 Q12888 A0287 Q13115 A0287 Q62767 A0288 O55100 A0288 O43759 A0288 Q62876 A0289 O35299 A0289 P53041 A0289 P53042 A0290 P32004 A0290 Q05695 A0290 P11627 A0291 Q61495 A0291 Q02413 A0292 Q16352 A0292 P23565 A0292 P46660 A0298 Q9Y5B6 A0298 P58501 A0327 P31424 A0327 P41594 A0330 P10687 A0330 Q9NQ66 A0330 Q9Z1B3 A0331 P04797 A0331 P04406 A0331 P16858 A0333 P13795 A0340 P06685 A0340 P05023 A0340 Q8VDN2 A0342 P13637 A0342 AAH37206 A0342 P06687 A0350 P02688 A0350 P04370 A0350 P02686 A0351 Q63345 A0351 Q16653 A0351 Q61885 A0352 P06905 A0363 P05712 A0363 P08886

A0363 P53994 A0364 Q9WVB1 A0364 P35279 A0364 P20340 A0365 Q9JKM7 A0365 Q96AX2 A0366 P05426 A0366 P14148 A0366 P18124 A0367 P41123 A0367 P47963 A0367 P26373 A0368 P35427 A0368 P19253 A0368 P40429 A0369 P42123 A0369 P16125 A0369 P07195 A0370 P04642 A0370 P06151 A0370 P00338 A0371 Q63716 A0371 P35700 A0371 Q06830 A0372 P35704 A0372 Q61171 A0372 P32119 A0373 Q9P0K1 A0373 Q9R1V6 A0375 P19527 A0375 P08551 A0375 P07196 A0376 P08553 A0376 P12839 A0376 P07197 A0377 P15999 A0377 Q03265 A0377 P25705 A0379 Q9ERA8 A0379 P35435 A0379 P36542 A0395 P48962 A0395 P12235 A0395 Q05962 A0396 P51881 A0396 P05141 A0396 Q09073 A0417 P47755 A0417 P47754 A0418 P47756 A0418 P47757 A0419 P13020 A0419 P06396

A0421 P16587 A0422 P43004 A0422 P43006 A0422 P31596 A0423 P00507 A0423 P05202 A0423 P00505 A0424 P09606 A0424 P15105 A0424 P15104 A0425 Q9Z2L0 A0425 Q60932 A0425 P21796 A0426 P81155 A0426 Q60930 A0426 P45880 A0427 P12928 A0427 P53657 A0427 P30613 A0428 P09117 A0428 Q9DBA4 A0428 P09972 A0429 Q9ER34 A0429 Q99KI0 A0429 Q99798 A0430 P70615 A0430 P14733 A0430 P20700 A0431 P19234 A0431 Q9D6J6 A0431 P19404 A0432 O35244 A0432 P30041 A0432 O08709 A0433 P48500 A0433 P00938 A0433 P17751 A0434 P16612 A0434 P15692 A0434 Q00731 A0435 O88954 A0435 Q13368 A0435 O88910 A0436 Q9WV34 A0436 O88953 A0436 Q14168 A0437 P47728 A0437 Q96BK4 A0437 Q08331 A0438 Q07266 A0438 Q16643 A0438 Q9QXS6 A0439 Q99623

A0439 Q61336 A0440 O95970 A0440 Q9JIA1 A0441 P55053 A0441 Q05816 A0441 Q01469 A0442 O08553 A0442 P47942 A0442 Q16555 A0443 O75746 A0444 Q9UQL8 A0444 Q9Z148 A0445 Q64548 A0445 Q16799 A0446 P55060 A0446 Q9ERK4 A0466 P47197 A0466 Q60823 A0466 P31751 A0470 Q96M96 A0470 O88387 A0470 Q91ZT5 A0473 Q9Y698 A0473 Q99PR9 A0473 O88602 A0476 O35147 A0476 Q92934 A0476 Q61337 A0484 P35569 A0484 P35568 A0484 P35570 A0488 P41739 A0488 P27540 A0488 P53762 A0489 Q61831 A0489 P49187 A0489 P53779 A0491 P46734 A0491 O09110 A0605 P08461 A0605 P10515 A0605 Q8R339 A1851 P08413 A1851 Q13554 A1851 P28652 A1925 P30835 A1925 P17858 A1925 P12382 A2007 P08107 A2007 Q07439 A2007 P17879 A2084 Q96JV5 A2084 Q9DB63

103

A2331 P35749 A2331 Q63862 A2331 O08638 A2343 Q8R4I6 A2343 O88990 A2343 Q08043 A2344 Q9QXQ0 A2344 O43707 A2344 P57780 A3166 Q8VDD5

A3166 Q62812 A3166 P35579 A3202 P02564 A3202 Q91Z83 A3202 P12883 A3773 Q9D6M3 A3773 Q9H936 A3899 Q9Z252 A3899 O88951 A3899 Q9HAP6

A3958 Q05096 A3958 P46735 A3958 O43795 A3961 P35580 A3961 Q9JLT0 A3961 Q61879 A3968 O88448 A3968 Q9H0B6 A3976 Q8K4G5 A4014 Q9D6R9

A4014 Q9NQ38 A4015 Q9BWX7 A4015 Q9D5J9 A4048 Q13182 A4048 Q9CQL8 A4048 Q63781 A4049 P16475 A4049 AAH26760 A4049 Q64119

Because DIP database used in this project contains only three species Caenorhabditis

elegans, Drosophila melanogaster and yeast. There is no corresponding MASC

network within it. In this case, the ID mapping will not be listed here.

104

105

Appendix D PPID and Protein Name

For convenience of reading, the PPIDs and corresponding protein names used in

MASC network are listed here:

A0001 NR1 A0002 NR2A A0003 NR2B A0007 ACTN A0008 CALM A0009 PRKA9 A0010 SPNB A0011 CAMK2A A0012 PLCg-1 A0013 DLG4 A0014 DLG2 A0015 DLG1 A0016 DLG3 A0017 Tubulin A0018 Src A0020 PTP1D A0022 GRIK2 A0024 SynGAP A0030 FAK2 A0033 Rap2 A0037 Rab3 A0039 SYT1 A0040 STX A0041 A0075 SPANK1 A0081 CortBP-1 A0083 CTTN A0086 ZO-1 A0091 ACT A0095 DNM1 A0103 mGluR1a A0104 HOMER1 A0106 cPLA2 A0107 DLGAP1 A0112 b-catenin A0114 ATP2B4 A0117 NSF A0118 PTPN5 A0119 CLTC A0123 PP2B A0124 RACK-1 A0126 GRB2 A0132 PPP2CA

A0134 MTAP2 A0138 RAF1 A0143 AKAP5 A0144 PKCepsilon A0145 PRKACA A0147 PPP1CA A0149 Myosin (V) A0152 RAN A0153 A0162 Jip-1 A0164 MKK7 A0166 HPK1 A0168 PKCbeta A0181 Erk2 A0188 APPL A0194 Rac1 A0196 NF-1 A0200 H-Ras A0203 RalA A0204 Filamin A0214 HAPIP A0215 GAP43 A0217 FUS A0219 N-cadherin A0222 GSK3 beta A0231 A0232 A0233 A0238 STXBP1 A0259 PKCgamma A0260 PDK-1 A0262 nNOS A0265 Erk1 A0266 MEK1 A0267 MEK2 A0268 Rsk-2 A0276 CIT A0277 Arg3.1 A0285 Bassoon A0286 p53BP1 A0287 MKP2 A0288 Synaptogyrin A0289 PP5

106

A0290 L1CAM A0291 DSG A0292 INA A0327 mGluR5 A0330 PLCb A0331 GAPDH A0333 SNAP25 A0340 ATP1A1 A0342 ATP1A3 A0351 MOG A0352 PLP1 A0363 Rab2 A0364 RAB6A A0365 Rab37 A0366 RPL7 A0367 RPL13 A0369 LDHB A0370 A0371 TDPX2 A0373 ADAM22 A0375 Neurofilament triplet L protein A0376 Neurofilament Triplet M A0377 ATP5A1 A0379 ATP5C A0395 SLC25A4 A0396 SLC25A5 A0417 CAPZ alpha A0418 CAPZ beta A0419 Gelsolin A0421 ARF3 A0422 SLC1A2 A0423 Asp aminotransfersase A0424 Gln synthetase A0425 MVDAC-1 A0426 VDAC-2 A0428 ALDOC A0429 Aconitase A0430 A0431 NADH-ubiquinone

oxidoreductase 24 kDa subunit, mitochondrial A0433 Triosephosphate Isomerase A0434 EGF-164 A0435 DLGH3 A0436 DLGH2 A0437 Calretinin A0438 DBN1 A0439 D-Prohibitin A0440 Leucine-rich Glioma-inactivated 1 protein A0441 E-FABP A0442 DPYSL2 A0443 SLC25A12 A0444 G9A A0445 S-REX A0446 A0466 AKT2 A0470 Frabin A0473 Stargazin A0476 Bad A0484 IRS-1 A0488 HIF-1 A0489 MAPKp49 A0491 MAP2K3 A0605 DLAT A0900 A1851 CaMKII beta A1925 Phosphofructokinase B A2007 HSPA1 A2331 MYH11 A2344 ACTN4 A3166 MYH9 A3773 GC1 A3899 VELI2 A3958 MYO1B A3961 MYH10 A3968 KLC2 A3976 ABLIM1

107

Bibliography

[Adamic 2001] L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman. (2001). Search in power-law networks. Phys. Rev. E 64, 046135.

[Albert 2000] Albert, R., Jeong, H. & Barabasi, A. L. (2000). Error and attack tolerance of complex networks. Nature 406, 378–382.

[Andrew 2005] Andrew Pocklington. (2005). Identifying functional components in biological networks. To be published.

[Armstrong 2005] J.D. Armstrong, H. Husi, M. Cumiskey, T.J. O’Dell, P.M. Visscher, R. Emes, A.J. Pocklington, Blackstock, Choudhary, and S.G.N. Grant. Complex and network analysis of synapse proteomes. (2005).To be published.

[Barabasiasi 1999] Barabasi, A.L & Albert, R. (1999). Emergence of Scaling in Random Networks Science 286, 509-512.

[Cohen 2000] R. Cohen, K. Erez, D. ben-Avraham, and S. Havlin. (2000). Resilience of the Internet to random breakdowns. Phys. Rev. Lett. 85, 4626–4628.

[Diego 2005] Diego di Bernardo, etc, 2005. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nature Biotechnology, volume 23.

[Efron 1979] B. Efron. Computers and the theory of statistics: Thinking the unthinkable. SIAM Review, 21:460-480, 1979.

[Fox 2001] Fox, J. J. & Hill, C. C. (2001) From topology to dynamics in biochemical networks. Chaos 11, 809–815.

[Freeman 1977] L. Freeman, A set of measures of centrality based upon betweenness. Sociometry 40, 35–41 (1977).

[Garey 1979] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco (1979).

[Girvan 2002] M. Girvan and M. E. J. Newman. (2002). Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 8271–8276.

[Goh 2001] K.-I. Goh, B. Kahng, and D. Kim, Universal behavior of load distribution in scale-free networks. Phys. Rev. Lett. 87, 278701 (2001).

[Grant 2004] S.G.N. Grant, H. Husi, J. Choudhary, M. Cumiskey, W. Blackstock and J.D. Armstrong, 2004. THE organization and integrative function of the post-synaptic proteome. Excitatory- Inhibitory Balance: Synapses, Circuits, Systems, 13-44.

[Ito 2000] Ito, T., Muta, K. S., Ozawa, R., Chiba, T. et al. (2000), Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci. USA 97, 1143-1147.

108

[Jeong 2001] Jeong, H., Mason, S. P., Barabási, A.-L., Oltvai, Z. N., Lethality and centrality in protein networks. Nature 2001, 411, 41–42.

[Jeong 2000] Jeong et al. (2000), The large-scale organisation of metabolic networks. Nature 407, 651-654.

[Kauffman 1969] Kauffman, S. A. (1969). Homeostasis and differentiation in random genetic control networks. Nature 224, 177–178.

[Kernighan 1970] B. W. Kernighan and S. Lin, An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal 49, 291–307 (1970).

[Kitano 2002] Kitano, H. (2002). Systems biology: a brief overview, Science 295, 1662–1664.

[Krapivsky 2001] P. L. Krapivsky., S. Redner, 2001. Organization of Growing Random Networks. Phys. Rev. E 63, 066123-1--066123-14.

[Marelo 2002] Marelo Arenas, Leonid Libkin, 2002. A normal form for XML Documents. ACM, 1-58113-507-6/02/06.

[Maslov 2002] Maslov, S., Sneppen, K. (2002). Specificity and stability in topology of protein networks, Science 296, 910–913.

[Michael 2002] Michael Benedikt, Chee Yong Chan, Wenfei Fan, etc. DTD-Directed Publishing with Attribute Translation Grammars. VLDB 2002.

[Newman 2004] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004 Feb;69(2 Pt 2):026113. Epub 2004 Feb 26.

[Newman 2003] M. E. J. Newman, Mixing patterns in networks. Phys. Rev. E 67, 026126 (2003).

[Newman 2003]2 M. E. J. Newman. A measure of betweenness centrality based on random walks. cond-mat/0309045, 2003.

[Newman 1999] M. E. J. Newman and G.T. Barkema. Mont Carlo Methods in Statistical Physics. Oxford University Press. 1999.

[Peter 2001] Peter Buneman, Susan Davidson, Wenfei Fan, etc, 2001. Keys for XML. ACM, 1-58113-348-0/01/0005.

[Savageau 1971] Savageau, M. A (1971). Parameter sensitivity as a criterion for evaluating and omparing the performance of biochemical systems. Nature 229, 542–544.

[Schwikowski 2000] Schwikowski, B., Uetz, P. and Fields, S. (2000). A network of protein-protein interactions in yeast. Nature Biotech. 18: 1257-1261.

[Scott 2000] J. Scott, Social Network Analysis: A Handbook. Sage Publications, London, 2nd edition (2000).

[Susan 2002] Susan Davidson, Wenfei Fan, etc. 2002. Propagating XML Constraints to Relations. 19th International Conference on Data Engineering, ICDE2003.

109

[Tova 1998] Tova Milo, Sagit Zohar. Using Schema Matching to Simplify Heterogeneous Data Translation. VLDB 1998.

[Vazquez 2004] Vazquez A, Dobrin R, Sergi D, Eckmann JP, Oltvai ZN, Barabasi AL. The topological relationship between the large-scale attributes and local interaction patterns of complex networks. Proc Natl Acad Sci USA.

[Watts 1998] D. J. Watts and S. H. Strogatz. (1998). Collective dynamics of ‘small-world’ networks. Nature 393, 440–442.

[Wenfei 2001] Wenfei Fan, Leonid Libkin, 2001. On XML integrity constraints in the presence of DTDs. ACM, 1-58113-361-8/01/05.

[Yannis 1996] Yannis Papakonstantinou, Serge Abiteboul, Hector Garcia-Molina. Object Fusion in Mediator Systems. VLDB 1996.

[Yi 2000] Yi, T. M., Huang, Y., Simon, M. I.&Doyle, J., (2000). Robust perfect adaptation in bacterial chemotaxis through integral feedback control, Proc. Natl. Acad. Sci. USA 97, 4649–4653.

[Zanzoni 2002] Zanzoni A, Montecchi-Palazzi, etc. (2002). MINT: a Molecular INTeraction database. FEBS Lett. 2002 Feb 20;513(1):135-40.


Recommended