Date post: | 23-Apr-2023 |
Category: |
Documents |
Upload: | independent |
View: | 0 times |
Download: | 0 times |
A Study of
Differential Protein Interaction Network
Haogang ZHU
MSc Dissertation Presented to
School of Informatics
College of Science and Engineering
University of Edinburgh
July 16 2005
i
Authorship declaration
I, Haogang ZHU, confirm that this dissertation and the work presented in it are my
own achievement.
1. Where I have consulted the published work of others this is always clearly
attributed;
2. Where I have quoted from the work of others the source is always given. With the
exception of such quotations this dissertation is entirely my own work;
3. I have acknowledged all main sources of help;
4. If my research follows on from previous work or is part of a larger collaborative
research project I have made clear exactly what was done by others and what I have
contributed myself;
5. I have read and understand the penalties associated with plagiarism.
Signed:
Date: 16 Aug, 2005
Matriculation no: 0455504
ii
Acknowledgement
I would like to express my sincere gratitude to my supervisor, Dr. Douglas Armstrong,
for his support, patience, and encouragement throughout my graduate studies. It is not
often that one can find a supervisor and colleague that is always listening to the little
problems and providing support whenever and whatever he could.
My thanks also go to Dr. Andrew Pocklington for his sensible and intelligent
suggestion and explanation during the investigation of the algorithms. His advice
enlightened the author on the ideas presented in the dissertation and was essential to
the completion of this dissertation.
My parents, Jinping and Hong, receive my deepest gratitude and love for their
dedication and the many years of support during my current and previous studies that
provided the foundation for this work.
Last but not least I am deeply indebted to my girlfriend, Hui, for having the patience to
read and correct such a long technical dissertation from the other side of the world.
Without her unfailing support and encouragement I could never have completed this
work.
iii
Abstract
Following the current research about individual molecule network with system
biology perspective, the author seeks for a way of estimating differential protein
interaction networks from various public and commercial databases compared with
the reference established by SynProNet which presented a molecular complex for the
organization and function of the proteome within the postsynaptic terminal. To
compare different networks, two algorithms are adopted with respect to topological
property and biological significance respectively. Cluster comparison considers the
molecule interaction network to be consisted of clusters which are those important
components dominating the properties of a network and thus standing for the
difference of divergent networks. The algorithm discovers the cluster structure
underlying a network by divisive clustering. A concept of concise graph is proposed to
express the original network compactly without loss of information, and is utilized for
efficient cluster discovery. The differential networks are finally contrasted by
calculating the probability of correctly clustering with respect to the reference network.
Functional component comparison is then implemented to verify the conclusion from
cluster comparison and to investigate a comparison method with biological
significance. This algorithm predicts sets of molecules with potential cognitive
phenotype for each network. The consistency between these predicted molecules and
those generated in reference network is then used as an estimation result. The data
retrieve from various databases is guided by DTD generated by prorogating
functional constraints in relational schema into XML domain.
v
Table of Contents
Chapter 1 Introduction and background mining ................................................. 1
1.1 Introduction........................................................................................................ 1
1.2 Related work ...................................................................................................... 2
1.2.1 Robustness of network ............................................................................. 3
1.2.2 Current network models........................................................................... 4
1.2.3 Topological property of protein interaction network ............................... 5
Chapter 2 Network comparison algorithms.......................................................... 7
2.1 Cluster comparison............................................................................................. 7
2.1.1 Clique and concise graph ............................................................................ 9
2.1.2 Calculating shortest path betweenness...................................................... 12
2.1.3 Cluster division ......................................................................................... 20
2.1.4 Cluster modularity..................................................................................... 22
2.1.5 Searching cluster division ......................................................................... 25
2.1.6 Cluster structure comparison..................................................................... 28
2.2 Functional component comparison .................................................................. 30
2.2.1 Extrapolating probability .......................................................................... 32
2.2.2 Scoring sub-network ................................................................................. 34
2.2.3 Functional component searching .............................................................. 37
2.2.4 Functional component comparison ........................................................... 38
Chapter 3 Data retrieve......................................................................................... 41
3.1 Data overview and constraint driven data retrieve........................................... 41
vi
3.2 Data retrieve implementation ........................................................................... 42
Chapter 4 Result presentation and discussion .................................................... 45
4.1 Cluster comparison........................................................................................... 45
4.1.1 Single database estimation ........................................................................ 45
4.1.2 Robustness of cluster comparison algorithm ............................................ 49
4.1.3 Databases collaboration............................................................................. 51
4.2 Functional component comparison .................................................................. 52
4.2.1 Component score and z-score ................................................................... 52
4.2.2 Single database comparison ...................................................................... 55
4.2.3 Robustness of functional component comparison..................................... 58
4.2.4. Database collaboration ............................................................................. 59
4.3 Discussion ........................................................................................................ 60
Chapter 5 Conclusion and future work ............................................................... 63
5.1 Conclusion........................................................................................................ 63
5.2 Future work ...................................................................................................... 65
Appendix A Constraint Driven Data Retrieve .................................................... 67
A.1 XML functional dependency and key constraint propagation ........................ 67
A.1.1 Formalism of DTD and XML .................................................................. 67
A.1.2 DTD cast and XML functional dependency............................................. 70
A.1.3 Key constraint propagation ...................................................................... 75
A.2 Foreign key constraint propagation................................................................. 77
A.2.1 Foreign key driven pre-mapping .............................................................. 77
vii
A.2.2 One to many relationship and weak entity ............................................... 78
A.2.3 Many to many relationship....................................................................... 80
A.2.4 More complex relationship ...................................................................... 84
A.2.5 Formalized pre-mapping algorithm.......................................................... 87
A.2.6 Post-mapping............................................................................................ 91
Appendix B Estimated databases in project ....................................................... 97
Appendix C ID mapping ....................................................................................... 99
Appendix D PPID and Protein Name ................................................................ 105
Bibliography ........................................................................................................... 107
1
Chapter 1 Introduction and background mining
1.1 Introduction
Following the enthusiastic study on genome, the research of proteins especially with
system biology perspective becomes an attractive approach to various unsolved
problems. SynProNet (Synaptic Proteome Network) project [Grant 2004] carried out
innovative proteomic studies and has identified about 700 proteins comprising the
proteome of the postsynaptic terminal of central nervous system synapses. Beyond
identifying specific proteins, the authors also presented a molecular network model for
the organization and function of the proteome within the postsynaptic terminal.
The interaction network of post-synaptic signal transduction complex MASC
(MAGUK Associated Signaling Complex), which is the central database used in this
project, forms a well-defined component within the post-synaptic density [Armstrong
2005], being that which co-precipitates with the scaffolding protein PSD-95 and the
NMDA receptor sub-unit NR2A. The complex plays a central role in the processing of
synaptic transmissions via the NMDA receptor which leads to the activation of
signaling pathways underlying cognition functions.
Topologically, the synaptic proteome network has a scale-free architecture that is
evolutionary conserved. Activation of channels and receptors (such as NMDA
receptors or Voltage Dependent Calcium channels) could initiate signaling in the
MASC, which then orchestrates the multiple pathways and cellular mechanisms for
the expression of plasticity [Grant 2004]. It is this scale-free network architecture that
gives rise to the robustness and complexity as observed in molecular studies of
synaptic plasticity, and allows distinct patterns of activity in the interaction network.
As to the research on various networks, most previous researches mainly focus on the
topology properties and functioning property of single protein interaction network.
However, what will happen when facing various protein interaction networks for one
particular cell or tissue? This hypothesis is practical if one tries to retrieve a protein
2
interaction network from various relevant databases but unfortunately gets fairly
different results. Trapped into the embarrassment, how to make the choice will be
overly important for further research because the inappropriate decision is proved to
be quite misleading.
Although accurate and complete, the curation of MASC has cost tens of thousands
pond which is spent on the licenses of those commercial databases as well as
employees for maintenance. Therefore, this project intends to estimate the goodness of
networks automatically derived from existing public and commercial databases
compared with MASC database. The result can be used as an evaluation about to what
extent MASC deserves the money spent on it. Moreover, MASC can be automatically
maintained if there is little difference between computer generated network and
manually curated MASC.
Therefore, this project will mainly focus on these differential protein interaction
networks, and will come up with a systemic approach for comparing these
semistructured networks. Based on the manually maintained MASC database the
investigated algorithms will be used to evaluate the protein interaction networks
queried from various databases. The estimation result can be further used as a method
to construct new interaction networks with various existing data sources and to
evaluate these constructed networks.
1.2 Related work
So far, there is little literature about the precise semantic definition of “difference” of
protein interaction networks as well as general networks, which makes the comparison
of the differential networks to be an ambiguous job. However, it has been a long time
for researchers to focus on some general properties of all sorts of networks and graphs.
Recent studies of network structure have concentrated on a small number of properties
that seems to be common to many networks and can be expected to affect the
functioning of networked systems in a fundamental way. Among these, perhaps the
best studies are the “small-world effect”, network transitivity or “clustering” [Watts
1998], and degree distributions [Barabasi 1999]. Many other properties however have
3
also been examined. Examples include resilience to the deletion of network nodes
[Albert 2000] [Cohen 2000], navigability or searchability of networks [Adamic
2001], community structure [Girvan 2002] [Newman 2004]. Two perspectives are
adopted here: the robustness and the topological properties of networks, which are
concluded and related to this project in the following sections. Other properties will be
introduced as they are used in the coming discussion.
1.2.1 Robustness of network
Complex protein and genetic networks are large systems of interacting molecular
elements that control the propagation and regulation of various biological signals
[Kitano 2002]. They are essential classes of biological computations which reflect
vital cellular processes such as the regulation of the cell cycle and gene expression.
Because some intracellular processes are important for the survival of the cell, they
need to be robust to the variation in the environment. For example, the variation of the
concentration of network components in a metabolic network may affect numerous
processes [Jeong 2000]. The complex architecture of such networks raises the
question of the stability of their functioning and topology. How can such complex
dynamical systems achieve important cellular tasks and remain stable against the
variations of their internal parameters? The answer to these questions will definitely
help us to “filter out” those most essential parts of the network and enable us to reveal
the potential difference between two networks rather than simple difference of nodes
and edges.
Three decades ago, [Savageau 1971] hypothesized that robustness was an essential
property of some genetic networks whose functions are preserved even if some of the
components in the networks were changed. It draws our attention on those robust parts
of the network when contrasting different networks with respect to the same tissue or
cell. More recently, several compelling theoretical studies [Yi 2000] [Albert 2000]
demonstrated that the key processes of specific intracellular networks exhibit a robust
behavior to variations of biochemical parameters. Reliability has emerged as a
fundamental concept for the characterization of the dynamical stability of biological
systems.
4
Until this point, things have come up to one question. Are there universal design
patterns that would determine the reliability of a given class of networks? It will then
be possible to predict and contrast dynamical properties of these networks even
without a full knowledge of the molecular details [Fox 2001]. In particular, it is
important to characterize dynamical reliability as the ability for the network to perform
a sequence of biological tasks in the presence of perturbations. As it will be seen in
next chapter, the concept of reliability plays an essential role in the network
comparison algorithm. The construction of “concise graph” from the original
interaction network and the identification of clusters within a network by gradual
perturbations to the interactions (edges) are all based on the thought of reliability.
1.2.2 Current network models
Specifically, a method for modeling molecular activities by a simple two-state model
was originally created for the study of the dynamics of genetic networks [Kauffman
1969], but was later applied to evolution and social models and became a prototype for
the study of dynamical systems. In this model, the network’s architecture follows a
random topology in which every element interacts, with the same probability, with K
other elements. Each network element has two functional states: active and inactive
states as it will be used in functional component comparison algorithm in Chapter 2.
The state of a given element is determined by its interactions with other elements of
the network. An overall parameter for the whole system is the probability p that an
element of the network becomes active after interacting with other elements. This
probability can be inferred by regression model according to the reaction rates and the
relative concentration of each element. The property of the dynamical process within a
given network is determined by its topological parameter K and its biochemical
parameter p. Under the simple assumption of a random network topology, a pretty rich
and unexpected dynamical behavior of the network was found [Kauffman 1969].
During the evolution of the system, the network elements pass through different states
until they reach a cyclic behavior. Different cycles are possible, which represent a
variety of intracellular tasks. In general, there exist two situations in network activity:
chaotic and robust. In the chaotic status a perturbation in the state of a single element
can make the system jump from one cycle to another. In the robust status all such
5
perturbations die out over time. However, this model is inadequate to further
investigate the functioning of biological networks that have heterogeneous
architecture, but it provides an exciting idea to compare networks by contrasting their
activated components, which is one of the algorithms used in next chapter.
In recent years, the analyses of the topology of large intracellular networks [Jeong
2000] have revealed a common architecture. In these natural networks, not all
elements as well as links are equivalently important. A small but significant fraction of
the elements are highly connected, while most of the elements are sparsely connected.
This architecture is called scale-free topology and was found to be common in
networks in social, ecological, and protein networks [Maslov 2002]. Although the
effect of perturbations on the topology of these networks has been thoroughly studied
[Albert 2000], their dynamical properties are still not clear [Barabasiasi 1999],
especially when it comes to how much different perturbations influence the network.
1.2.3 Topological property of protein interaction network
There have been a series of studies focusing on identifying the biological modules in
various cellular networks, ranging from the metabolism [Jeong 2000] to genetic
networks [Ito 2000]. These researches assume that the proteins work together to
achieve some well-defined biological function as a group. Previous experiments
show that protein complexes that act as functional modules carry out many biological
functions. From the network perspective these modules should appear as different
group of nodes that are highly interconnected with each other but have only a few
links to the nodes outside the module (e.g. those hubs in scale-free network).
However, the scale-free topology apparently forbids the existence of independent
modules in the network, because the hub proteins’ ability to interact with a high
fraction of each module’s components makes a module’s relative isolation nearly
impossible. Currently, [Vazquez 2004] proposed and experimented that the
network’s scale-free topology can be reconciled with its potential modularity within
the framework of a hierarchical modularity, which enlighten the author to utilize the
hierarchical cluster structure to depict the essential properties within a network.
This means that protein interaction networks are fragmented into many distinct
6
clusters [Schwikowski 2000]. A set of proteins in a system are always dominated by
a giant cluster that contains a significant fraction of all connected proteins, such that
one can find a path of protein interactions between any two proteins belonging to this
giant component. A small fraction of proteins, however, are completely isolated. A
common situation is that the giant component coexists with many isolated proteins.
In this case, if those giant components are disregard, the cluster size distribution
follows a power law. If we continuously attack these hub proteins as well as their
neighbors in the network, the network can be quickly separated and finally dispelled.
This is exactly the basic idea of “concise graph” proposed in next chapter.
The discussion above concludes some of the current research on network models and
their illumination to this project. Other contributions will be discussed on the way of
presentation of the algorithms later. The next chapter will mainly focus on the
detailed methodology which is used in the project.
7
Chapter 2 Network comparison algorithms
As a common sense, comparison or evaluation should end up with a quantified score
such as a probability to be identical or required energy for the transformation. Based
on different hypothesis and criteria, the comparison can diverge and give rise to
different conclusions. In this chapter, the author is going to demonstrate two different
approaches based on topological and biological significance of molecule interaction
networks.
2.1 Cluster comparison
The first approach is based on the topological property of protein interaction network.
It considers the network to be consisted of clusters which are those important
components dominating the properties of a network. The algorithm origins from a
method for identifying clusters within a network which is firstly proposed by Newman
in 2004. Then, the author proposes the concept of “concise graph” which is a more
compact expression of original network without losing the connection information,
and tries to enhance the computational efficiency with the help of concise graph. The
discovered clusters are then used for network comparison based on the belief that the
properties of network are, at least partly, held by the cluster structure.
It has been a long history for the study of cluster structure in networks. Most of the
research including graph theory, computer science and hierarchical clustering in
sociology [Scott 2000] made use of the ideas of graph partitioning. Unfortunately, in
most cases, a promising solution for a network partitioning is believed to be an
NP-complete problem [Garey 1979], which makes it intractable for large scale
problems like molecule interaction network which may contain thousands of vertices.
The best known achievement is perhaps the Kernighan–Lin algorithm [Kernighan
1970] with O( 3n ) time complexity on sparse graphs. This algorithm recursively
exchanges sets of nodes to perform hill-climbing to find improvements where no
single swap will improve the total net cut. However, how a given network is broken
8
down into clusters can hardly provide any information about how many such clusters
they are going to be and what is the size of each cluster. Furthermore, there is no reason
that the number of inter-cluster edges should be minimized because it is nature to have
more such edges between large clusters than between small ones [Newman 2004].
For expression convenience, we defined the protein interaction network to be an
undirected graph.
Definition: A protein interaction network G is an ordered pair <V, E>, where V is a
non-empty set of proteins, and VVE ×⊆ is a set of interaction, e=(u, v) ∈E if and
only if protein u interacts with protein v. Neighbor (v) is a mapping V → ∗V . It returns
a set of neighbors of vertex v∈V. Degree(v) returns the degree of vertex v.
Within the scope of this dissertation, we presume that the interaction network dose not
contain self-loops and the adjacency is irreflexive which means that E = {(u, v) | u, v
∈ V have interaction ∧ u ≠ v} and 1e =(u, v) ∈ E, 2e =(u, v) ∈ E if and only
if 1e = 2e .
The most intuitive characteristics for networks are local properties, which concern the
connection among the vertices nearby, and global properties focusing on the
interaction among all vertices in the network. Following the previous research on
graph theory dealing with overall topological characteristics, “color problem” as well
as current and voltage in electric circuits, etc, recent research triggered an intensive
research on the local property of network such as cluster identification [Newman 2004]
and functional clustering of proteins in a network [Andrew 2005]. This means that, in
spite of global characters, the local properties of a network, which make more sense in
some circumstance, are more and more of interest to researchers.
From a philosophy perspective, global and local properties are not independent. This
chapter will show how the computation of cluster structure can benefit from the scale
free property of molecule interaction networks. The basic framework of the algorithm
follows the work [Newman 2004] which attempts to find the least similar connected
pairs of vertices and then remove the edges between them. By doing this repeatedly,
the network is gradually divided into smaller components within which the vertices
9
contain compact connections.
Instead of looking for the most weakly connected vertex pairs, [Newman 2004] looks
for the edges in the network that are most “between” other vertices, meaning that “the
edge is, in some sense, responsible for connecting many pairs of others”. To make
things more concrete, if two communities are joined by only a few inter-community
edges, then all paths through the network from vertices in one community to vertices
in the other must pass along one of those few edges [Newman 2004], which will result
in high betweenness value. Given a suitable set of paths, one can count how many goes
along each edge in the graph, and we can expect this number to be large for the
inter-community edges, which provides a method for identifying them.
Therefore, the following sections will concentrate on the discovery of cluster structure
within a network using shortest path betweenness. Since the cluster structure
discovery is always computational intensive, the author proposes a concept called
concise graph on which betweenness can be efficiently computed.
2.1.1 Clique and concise graph
The simplest local property of a network is probability the connection between one
vertex and its neighbors. This can be defined as a clique which is a set of centrally
connected vertices in a network G <V, E>.
Definition: A vertex set vC = {v} ∪ Neighbor (v) is defined to be a clique centralized at
v∈V. It is represented as vC =v[{v} ∪ Neighbor (v)]. The potential of a clique is the
number of vertices within this clique.
Because of the reported scale-free property of molecule interaction network, only a
small part of cliques have relatively large potential. The central vertices in these
cliques act as the “skeleton” of the network, which means that these proteins dominate,
directly or intermediately, most of the interactions within the network, and therefore
maintain the major properties of the network.
The network in Figure 2-1 shows a typical scale free network. The skeleton vertices
represented by black color are calculated by the algorithm to be discussed soon. It can
10
be seen that these vertices take up all of the interactions, and the cliques centralized on
these nodes contain all vertices in the network. Consequently, there is no lost
information if we only know the connection within these cliques. At the same time,
each clique reveals the local property which means that the central vertex can only
have interaction to those non-neighbor vertices through its neighbors.
Figure 2-1: A small scale free network with the skeleton vertices colored with black.
The skeleton vertices are acquired by continuously eliminating the vertices with the
highest degree in the network until all vertices are either removed or have 0 degree.
The algorithm is given bellow:
Initialization:
degreeArray is an array holding degrees of all vertices in the network;
remainedVertices is a set containing vertices which have not been removed from the
network and have non-0 degree. It is initialized with the whole set of vertices;
cliqueSet is a set having all cliques generated by the algorithm;
Loop:
While remainedVertices is not empty{
Search for vertex maxv with the largest degree value in degreeArray;
cliqueSet = cliqueSet ∪maxvC ;
for vertices in Neighbor ( maxv ), decrease the degree value in degreeArray by 1;
remainedVertices = remainedVertices-{ maxv }
remainedVertices = remainedVertices-{v | the degree of v in degreeArray is zero}
}
11
Return: cliqueSet.
It can be seen that, instead of physically removing vertices from graph, the attack on
the hub vertices is done by subtracting the degree values of corresponding vertices in
degreeArray. There is no modification to the network during the running of the
algorithm. The returned result is a set of cliques with which a concise graph can be
constructed.
Definition: A concise graph of network G <V, E> with clique set cliqueSet is defined to
be C_G <C_V, C_E> where C_V = cliqueSet, C_E ={( 21 , CC )| 21 , CC ∈ C_V, 1C ∩ 2C Φ≠ }.
As an example, a concise graph can be derived from the network in Figure 2-2 by
eliminating vertices in an order of 5, 11, 15, 12, 19, 10, 17, 3, 6, 16, 20. In this case, the
result cliques are 5[5, 6, 3, 11, 2, 16], 11[11, 9, 5, 7, 10, 13], 15[15, 20, 21, 4, 19],
12[12, 9, 7, 8, 13], 19[19, 15, 20, 21, 17], 10[10, 9, 8, 11, 14], 17[17, 18, 16, 19], 3[3, 5,
7, 2], 6[6, 1, 5, 4], 16[16, 5, 17, 18], 20[20, 15, 19, 21]
The concise graph of the original network is represented in Figure 2-2 generated by
application. It is obvious that the resulted concise graph contains less vertices and
edges compared with the original network, which is the consequence of the scale-free
property of the network as it was stated in [Albert 2000] such that continuous attack
on “hub” vertices will quickly dispel a scale-free network.
Figure 2-2: The concise graph constructed by attacking on hub vertices. It is obvious that it contains less vertices and edges compared with the original graph in Figure 2-1.
The concept of concise graph will be used to efficiently calculate the shortest path
betweenness in the following sections.
12
2.1.2 Calculating shortest path betweenness
As it was mentioned at the beginning of this chapter, we will adopt the shortest path
betweenness during the process of divisive clustering. [Freeman 1977] defines a
traditional calculation for vertex betweenness based on the constraint that multiple
shortest paths between a pair of vertices are equally important and are given equal
weight values summing to 1. For instance, if there are three shortest paths, each will be
given weight 1/3. In this dissertation, the author adopts the same definition for edge
betweenness although other definitions are possible [Goh 2001]. To calculate the
fraction of the paths flowing along each edge in the network given a specific source
vertex, we make best use of concise graph and generalize the breadth-first search in
[Newman 2004].
Because shortest path betweenness finds the shortest paths between all pairs of
vertices and counts how many runs shortest paths go along each edge, the algorithm
will be repeated with all different source vertices. For example, if there are n vertices
in a network, we will calculate betweenness for n times with different vertex as the
source. In this case, large n value will be computational demanding, and thus require
an efficient algorithm.
Fortunately, with the help of concise graph, the betweenness scores of edges proposed
in [Newman 2004] can be efficiently computed. Specifically, given a source vertex s,
the corresponding betweenness can be calculated by propagating and retrieving (back
propagating) massage over the concise graph. Before furthering on the algorithm, we
shall define a concept of separator on concise graph.
Definition: A separator from clique uC to vC is defined to be a set S such that:
S=⎩⎨⎧
∩∩∈ else
if}{
vu
vu
CCCCvv
It is denoted as uC ⎯→⎯SvC
The algorithm of massage propagation is depicted as:
Initialization:
13
If the source vertex sv doesn’t form a clique centralized on it, insert svC into concise
graph and set up the edges between svC and other cliques;
Assign a distance and the number of path to svC such that
svC .distance = 0 and svC .
pathNumber = 1;
Add clique svC into upperCliques which is a set containing the cliques in the upper
level during the message propagation;
cliquesToBeProcessed = UeupperCliquC
C∈
)(Neighbor - upperCliques;
Push upperCliques into backPropagationStack which is a stack reserving the layered
structure of the concise graph. This stack will not be used until massage back
propagation.
Loop:
While cliquesToBeProcessed is not empty{
For all vC ∈cliquesToBeProcessed {
vC .distance=⎩⎨⎧
+′⎯⎯→⎯′+′
∩∈′ else 2 if 1min
}{
)( .distanseCCC.distanseC v
v
esupperCliquvCC Neighbor;
vC ={v} ∪ ( vC - UesupperCliquvC ∩∈ )(NeighborC
C );
If vC contains only v, delete vC from concise graph;
Delete edge ( vC , C′ ) if esupperCliquCC v ∩∈′ )(Neighbor and vC .distance
⎩⎨⎧
+′⎯⎯→⎯′+′
≠ else 2
if 1 }{
.distanseCCC.distanseC v
v;
∑∩∈′ ∈
−×′=esupperCliquvC esupperCliquC
vv
vSpathNumberpathNumberC)(
||.NeighborC
.C U , where
C′ ⎯→⎯SvC and “|A|” denotes the potential (number of elements) of set A;
}
Delete non-central vertex v from the clique whose distance value is not the
minimum among all cliques containing v;
Push cliquesToBeProcessed into backPropagationStack;
tempCliquesSet = cliquesToBeProcessed;
14
cliquesToBeProcessed = U ∈ eProcessedcliquesToBC
C)(Neighbor - upperCliques -
cliquesToBeProcessed;
upperCliques= tempCliquesSet;
}
Figure 2-3 shows an example of the algorithm with the source vertex 16. Starting from
clique 16C in the concise graph in Figure 2-2, the algorithm iteratively collects the
direct neighbors of the cliques in upperCliques in the concise graph excluding those in
upperCliques and current cliquesToBeProcessed, and stores them as new
cliquesToBeProcessed. The upperCliques is updated with the old
cliquesToBeProcessed at the end of each iteration. Indeed, this process separates the
concise graph into several layered structure. The cliques in cliquesToBeProcessed are
on the next level away from the source clique, and will be processed in the next
iteration. This can be considered as a procedure of message propagation from the
source clique away to other cliques level by level. From this perspective, the algorithm
can be considered as a breadth-first traverse of the concise graph.
Figure 2-3: The concise graph with source vertex 16. The separators are denoted as the squares on edges and the two numbers above each clique are the distance value and number of paths respectively.
Separators between the cliques in upperCliques and those in cliquesToBeProcessed
are calculated according to the definition and are displayed as the squares on the edges
of the concise graph in Figure 2-3. The distance and number of paths are then
computed based on the property of separators. There are two cases in total: for
uC ⎯→⎯SvC if S={v} which means that the central vertex of vC and uC are directly
connected in the original network, the possible distance of vC will be uC .distance+1;
15
otherwise, the possible distance of vC will be uC .distance+2. For instance, because
16C ⎯⎯ →⎯ }5{11C , the central vertices of 16C and 11C are connected via vertex 5. In the
case, when traversing from 16C to 11C , the interval will be 2 (16 → 5 → 11), which
makes the possible distance of 11C to be 16C . distance +2=2. A counterpart example
is 16C ⎯→⎯55C in which the central point 16 and 5 are directly connected. Therefore,
the possible distance of 5C is counted as 16C . distance +1=1. Because of the shortest
path criteria, only the minimum possible distance can be assigned as the final distance
of a clique. The distance values are shown as the first number above corresponding
cliques in Figure 2-3.
It can be seen that the distance value of each clique is actually the length of the shortest
path from the source vertex to the central vertex of the clique. In this case, it can be
said that the shortest distance between vertex 16 and vertex 11 is 2, and the shortest
distance between vertex 16 and vertex 5 is 1. This result can be verified in the original
network in Figure 2-1.
A more general example is the calculation of 12C .distance. Because 11C ⎯⎯⎯ →⎯ }13,7,9{12C
and 3C ⎯⎯ →⎯ }7{12C , two possible distance for 12C will be 11C .distance+2=4 and
3C .distance+2=4, where the distance of 11C and 3C has been calculated in previous
iteration. Therefore, the minimum distance of 12C is 4. The same computation forms
the distance value of 10C which is 3.
It is intuitive to understand the calculation of path number of a clique. The value of
path number of clique vC is actually the number of shortest routes from the source
vertex to the central vertex v. For uC ⎯→⎯SvC , let n(u, v) counts the number of
different shortest paths from u to v through shortest path starting from source vertex.
Suppose there are n(u) shortest paths from the source vertex to u. It is easy to see that
n(v)= ∑ ×u
u, vnun )()( where n(u, v) is || UesupperCliquCv
vS∈
− if there is shortest path
starting at source vertex and passing through u to v, which is indicated by the fact that
uC can form the distance of vC during the running of the algorithm, else n(u, v)=0.
16
Take the path number of 12C as an example. Because of 11C ⎯⎯⎯⎯ →⎯ }13,9,7{12C and
3C ⎯⎯ →⎯ }7{12C , both of which result in 12C .distance, the path number of 12C is
calculated as: |},,{|.11 379pathNumberC × + |}{|.3 7pathNumberC × =1× 3+1× 1=4. A
more interesting example is 10C . Because separator {11} contains a central vertex in
upperCliques which is {11, 17, 5, 3, 6, 19}, the path number of 10C is calculated
as 10111}{}{.}{. 511 =×+×=−×+× 1111pathNumberC10pathNumberC . Removal
of 11 from separator {11} avoids the duplicated calculation of the same path because
16C ⎯⎯ →⎯ }5{11C ⎯⎯ →⎯ }10{
10C and 16C ⎯⎯ →⎯ }5{5C ⎯⎯ →⎯ }11{
10C actually represent the same path
from 16 to 10 which is 16 → 5 → 11 → 10.
A more interesting phenomena is that vertex 2 is originally contained in both cliques
3C and 5C . Since the distance score of 3C is 2 which is not the minimum distance
value among those cliques containing vertex 2, 2 is eliminated from 3C during the
algorithm.
This operation ensures that all of the remained non-central vertices have their shortest
path from the source vertex via the central vertex of the clique. In this case, the number
of cliques containing a non-central vertex v depicts the number of the shortest path
from the source vertex to v.
The algorithm above, although works on concise graph, actually calculates the number
of paths from source vertex to all of other vertices on which cliques are centralized in
concise graph. Besides, the algorithm gradually divides the concise graph into layered
structure which is stored in stack backPropagationStack. These path numbers are
precisely what we need to calculate edge betweenness value, because if two vertices u
and v are connected, with v farther than u from the source s, then the fraction of a
geodesic path from s through u to v is given by u.pathNumber/v.pathNumber.
The algorithm of calculating edge betweenness in [Newman 2004] is described as:
1. Find every “leaf” vertex t, i.e., a vertex such that no paths from s to other vertices go
through t.
2. For each vertex i neighboring t assign a score to the edge from i to t of
17
i.pathNumber/t.pathNumber.
3. Starting with the edges that are farthest from the source vertex s, work up towards s.
For the edge from vertex i to vertex j, with j being farther from s than i, assign a score
that is 1 plus the sum of the scores on the neighboring edges immediately below it, all
multiplied by i.pathNumber/j.pathNumber.
4. Repeat step 3 until vertex s is reached.
With the help of concise graph, the “farther” relation has been given by the stack
backPropagationStack, and the path number of vertices in original network has been
heuristically calculated by the path number of cliques in concise graph:
For the central vertex v of clique vC , the path number of v is exactly the path number
of vC . Otherwise, the path number of non-central vertex u is the number of cliques
containing u in the concise graph after massage propagation.
As the cliques in backPropagationStack stack are stored in an opposite order of
massage propagation, the calculation of betweenness can be considered as message
back propagation because the algorithm will work from the “bottom” cliques up
towards the source clique. The passed message here will be the betweenness values of
the edges in lower level.
The algorithm of computing edge betweenness by massage back propagation is given
as bellow:
Initialization:
Betweenness of all edges are initialized to be 0;
cliquesToBeProcessed = pop(backPropagationStack);
Loop:
While backPropagationStack is not empty{
upperCliques = pop(backPropagationStack);
For all vC ∈cliquesToBeProcessed in an descendent order of distance{
weightSum= ∑−∈ }{
).,(vvC
weightuvu
, where (v, u).weight=eru.pathNumberv.pathNumb if (v,
18
u).weight is 0;
For all esupperCliquCC vw ∩∈ )(Neighbor {
If wC ⎯⎯ →⎯ }{vvC , (w, v).weight=(1+weightSum)×
erv.pathNumberw.pathNumb ; else for r∈S
– UesupperCliquCv
v∈
such that wC ⎯→⎯SvC , (r, v).weight = (1+weightSum)
×erv.pathNumberr.pathNumb and
; otherwise
..)).,(1().,(
0).,( if..).,().,(
).,(
⎪⎪⎩
⎪⎪⎨
⎧
×++
≠×+=
pathNumberrpathNumberwweightvrweightrw
weightrwpathNumberrpathNumberwweightvrweightrw
weightrw
}
}
cliquesToBeProcessed = upperCliques;
}
Figure 2-4 continues the demonstration of calculating betweenness of all edges in the
network of Figure 2-1. The backPropagationStack after the massage propagation is
{ 12C , 10C }, { 11C , 6C , 3C , 19C , 17C , 5C } and { 16C }. This forms three layers of the
network which is indicated by the dashed line in Figure 2-4. In this case, starting from
12C and 10C , the algorithm firstly calculates the sum of the betweenness values of the
edges “bellow” vertex 12. Within concise graph formed by massage prorogation, for
clique vC , the edges “bellow” vertex v is the ones between v and those vertices in
vC -{v}. However, from the concise graph in Figure 2-3, there is no edge bellow vertex
12 (weightSum=0 for 12C ), which indicates that 12 is indeed a leaf vertex.
For 11C ⎯⎯⎯⎯ →⎯ }13,9,7{12C , (7, 12).weight=(1+0) × 2/4=1/2, where 7.pathNumber is the
number of cliques ( 11C and 3C ) containing vertex 7 and 12.pathNumber is exactly
the pathNumber of clique 12C which is 4. Similar calculation yields (9,
12).weight=1/4 and (13, 12).weight=1/4. Besides, according to the algorithm, because
(11, 9).weight=0, (11, 9).weight is updated by (1+(9, 12).weight)×pathNumber9pathNumber11
..
19
=(1+1/4)×1=5/4. The same steps give rise to the result such that (11, 13).weight=5/4
and (11, 7).weight=(1+1/2)×1/2=3/4.
The result of betweenness of all edges are shown in Figure 2-4 after having gone
through all of the cliques and executing the same steps as it was described above.
Figure 2-4: Edge betweenness values of the original network with vertex 16 as the source vertex. The path number of each vertex is denoted by the number right above and the betweenness values are shown near edges. The dashed line indicates the division of network into three layers which are formed by message propagation in backPropagationStack. The central vertices are shaded with black.
Note that the algorithm guarantes that when it comes to vC , the betweenness values of
edges between v and vC -{v} have been all calculated if the vertex in vC -{v} is not a
leaf node. The leaf vertex t is indicated by 0 betweenness of the edge between v and t.
For example, when processing 19C , we find that (19, 15).weight=0 which shows that
vertex 15 is a leaf node. On the other hand, when processing 11C , (11, 9).weight is not
0 so that 9 is not a leaf node and (11, 9).weight has been calculated before the
algorithm comes to 11C .
Therefore, it can be seen that the order of cliques is essentially important for
correctly calculating betweenness values. This order is guaranteed by the “layered”
structure of the network which is maintained by backPropagationStack. Referring to
the algorithm of massage propagation, it is easy to find that the cliques in
backPropagationStack is ordered, which means that the cliques on the top of
backPropagationStack is the ones with longest distance so far and should be processed
20
earlier then any others in backPropagationStack.
Compared with the algorithm in [Newman 2004] which calculates the number of
distinct paths from the source vertex to all other vertices, the algorithm here is carried
out on the concise graph with much less vertices and edges based on the assumption of
scale free property of molecule interaction network. Since the massage propagation
and back propagation are actually based on breadth-first traverse of concise graph
rather than on original network, the algorithm above will be less computationally
intensive.
Besides, one obvious characteristic of the algorithms above is that most
computationally expensive procedures are achieved by set operation which can be
implemented efficiently using proper techniques such as hash set or hash map, etc.
As discussed at the beginning of this section, the algorithm above will be repeated with
all different source vertices in the network. The overall betweenness for an edge is the
sum of weight values of it in all iterations.
2.1.3 Cluster division
Edge betweenness depicts the importance of the edge in the network. The higher the
betweenness is, the more important the edge is. In other words, the edge with high
betweenness value is more likely to be the one lying between clusters. Consequently, if
we continuously knock out the edges with the highest betweenness values, the clusters
can be separated out within a short time. Thus the general form of network division
algorithm is as follows [Newman 2004]:
1. Calculate betweenness scores for all edges in the network according to the algorithm
in last section;
2. Find the edge with the highest betweenness and remove it from the network;
3. Recalculate betweenness for all remained edges;
4. Repeat from step 2.
As it was stated by Newman, the recalculation step is the most important feature of the
21
algorithm. Adversely, merely calculating the edge betweenness for all edges in the
network and then removing edges in a decreasing order of betweenness to produce the
division of the network will cause some problems. Particularly, once an edge in the
network is removed, the betweenness values for the remaining edges will no longer
reflect the properties of current network.
Figure 2-5: The tree structure of clusters of network in Figure 2-1. Some of the leaf nodes have two vertices because it is obvious enough for how to divide two vertices into two clusters.
This can cause unexpected problems. Take a scenario used by Newman as an example.
If two clusters are joined by two edges, but most paths between the two clusters flow
along just one of those edges, that edge will have a high betweenness score and will be
removed at early stage of the algorithm, while the second edge might not be removed
until much later. This will give rise to the problem that the obvious division of the
network into two clusters might not be discovered by the algorithm. What is even
worse is that these two clusters might be individually broken up before the division
between them is discovered.
Therefore, the algorithm used here recalculates betweenness after each edge removal.
As the edges are removed, the cluster structure will be separated out when the network
is no longer connected because of the removal of edges. The newly formed clusters
can be further divided into sub-clusters as the algorithm move on.
This will form a tree structure as it is shown in Figure 2-5. The cluster division tree
22
depicts how the network in Figure 2-1 is divided into clusters until only individual
vertex is contained in each cluster. This tree structure will be used for searching the
best cluster division in the following sections.
2.1.4 Cluster modularity
Until this point, the shortest path betweenness of a given network has been efficiently
calculated by constructing concise graph from the original network. As it was
mentioned earlier, higher betweenness value means that there are more shortest paths
passing through corresponding edge. It is straightforward to imagine that if two
clusters are connected by relatively less number of edges, larger number of shortest
path will pass through these edges, which will result in higher betweenness of these
edges. In this case, if we remove these edges iteratively, the network will be divided
into parts quickly until there is only one vertex in each part. Therefore, if we stop in the
middle of the algorithm before the network is broken down into single vertex, several
clusters can be retrieved. [Newman 2004] has shown that the shortest path
betweenness works well on recovering known clusters by cutting the cluster tree like
Figure 2-5 at proper position.
However, in practical situations, the algorithm will normally be used on those
networks in which the clusters are unknown. Generally, the algorithm can always
divide the network into clusters, even in completely random networks having no
meaningful cluster structure. In this case, it is necessary to know which divisions are
the best ones for a given network and how to search for these divisions among huge
amount of possibilities.
To quantify the “goodness” of found clusters, we use a measure called modularity used
in [Newman 2003]. For a network including k clusters, a k×k symmetric matrix M is
defined such that element ijM is the fraction of all edges linking vertices in cluster i to
vertices in cluster j. Particularly, as mentioned in [Newman 2003], each edge can only
be counted once in the matrix M to avoid duplication. This means that the same edge
should not appear both above and below the diagonal of M. To make sure that the
matrix is symmetric, an edge linking clusters i and j is split into half-and-half between
ijM and jiM . Moreover, when calculating modularity, all edges in the original
23
network are taken into account regardless whether the edges have been removed by the
clustering algorithm.
Suppose that the network in Figure 2-1 can be divided into three clusters as it is shown
in Figure 2-6. The corresponding matrix M is given on the right of Figure 2-6.
Figure 2-6: Left: an example of cluster division of network in Figure 2-1. The three clusters are indicated by the dashed circle with the index number outside. Right: the M matrix corresponding to the cluster division in the left figure. The total number of edges is 31 which is the denominator. The fraction number off the diagonal is split into half-and-half symmetrically with respect to diagonal.
It can be seen form the M matrix in Figure 2-6 that all elements in the matrix add up to
1 because the total amount of edges is 31. Moreover, since there are 2 edges between
cluster 1 and 2, these edges are split averagely between these two clusters, and thus
form 311 on 21M , and 12M , .
Based on M Matrix, [Newman 2004] defines two measures to quantify the quality of
cluster. The trace of matrix M, Trace(M)= ∑i
iiM gives the “compactness” of the
clusters in the network. Specifically, Trace(M) shows the fraction of edges
connecting vertices in the same cluster so that a good division of clusters should have
a high value, which means that majority of the edges are within the clusters and only
a small part of them lay between different clusters.
Nevertheless, Trace(M) is not always a good estimator of the quality of cluster
division. For example, if we place all vertices in a single cluster, this will give the
maximal value of Trace(M) = 1 with no information about cluster structure. To tackle
24
this problem, there should be a reference network which has no cluster structure so
that we can compare the difference between the reference network and the one we
are looking at.
In a network, vertices can form a cluster because they are tightly connected by the
edges within the cluster and are loosely connected to those vertices outside the
cluster. In other words, if the fraction of connections inside and outside the cluster is
the same, such a cluster will not exist. Therefore, serving as a contradictory to the
network with cluster structure, the reference network should have average edge
distribution both inside and outside clusters. The “average” means that each vertex is
linked to vertices in all of the clusters with the same probability. For example, to
generate a reference network with three clusters, each vertex is connected to another
vertex in each cluster with probability 1/3.
Because ijM can be interpreted as the probability that there is an edge connecting
an vertex in cluster i and another vertex in cluster j, the reference network should
have the same entries in M matrix. For example, an M matrix for a typical reference
network with three clusters should be:
⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜
⎝
⎛
919191
919191
919191
To quantify the property of reference network, another measure, iM =∑j
ijM , is
defined and represents the fraction of edges that connect to vertices in community i
[Newman 2003]. In a reference network in which edges fall between vertices
without regard for the clusters they belong to, ijM = ji MM should always be true.
This can be verified by some simple calculation. Suppose the reference network with
T clusters has M matrix filled with the same entry a. The only constraint for M matrix
is that all elements should add up to 1: 12 =aT so that:
25
21
Ta =
In this case, iM = jM =∑j
ijM =Ta =T1 , and thus jiMM = 2
1T
=a where a is defined
to be ijM .
With this property of reference network, we use the definition of modularity measure
proposed in [Newman 2003] to quantify the cluster strength:
Q =∑ −i
iiii MMM )( = ∑−ji
MMTrace,
2)(
This quantity measures the difference of fraction of the edges within the same cluster
between the estimated network and the reference network which has no cluster
structure. As it was proved above, when there is no cluster structure in the network,
which means that the number of within-cluster edges is no better than random,
ijM = ji MM will be true and the modularity value Q=0. On the other hand, the
maximum value of Q which is 1 indicates strong cluster structure. In [Newman
2004], it is reported that modularity values for normal networks typically fall in the
range from about 0.3 to 0.7 in practice. Particularly, according to the M matrix in
Figure 2-6, the modularity value of the cluster division is 0.5286.
2.1.5 Searching cluster division
The modularity value defines a measure of “fitness” of cluster division in a network
so that it can be utilized to search for a good cluster division. To make the searching
algorithm easy and interpretable, the break down of network is illustrated with a
dendrogram [Newman 2004] in which the nodes at the bottom represent individual
vertices of the network (Figure 2-7). Moving up the dendrogram will join the vertices
together and form larger clusters until reaching the top where all vertices are
connected together. On the other hand, from up to bottom, the dendrogram describes
how a network is split into smaller clusters until there is only one vertex in each
cluster [Newman 2004].
The dendrogram shows the order in which the network is divided and can be easily
achieved from the cluster division tree like the one in Figure 2-5. Particularly, there is
26
an additional “timer” which indicates the moment the clusters are generated by
removing edges. At each time step, in all current clusters, only one edge with the
highest betweenness score can be removed. Whenever the removal of edges gives rise
to a separation of a cluster into sub-clusters, these sub-clusters will be marked by the
current “time” which is actually the counted number of removed edges at this moment.
With the “time” of cluster division, the cluster division tree can be easily converted
to the dendrogram by reversing the divisive process: from time 0, working from the
clusters containing only individual vertices, join the clusters in the opposite order of
removing edges. The joining procedure is achieved by the following steps:
Figure 2-7: A dendrogram for the cluster division tree in Figure 2-5. A cross-section of the tree at any level, as indicated by the dashed line, will give rise to a cluster structure. According to the algorithm of generating the dendrogram, the height of the clusters are indicative of the order in which the joins (or split indirectly) take place [Newman 2004]. The height values here are just for demonstration purpose and are not as accurate as the ones computed by program.
1. Put all clusters containing single vertices at the bottom of dendrogram, and start
the timer of joining clusters;
2. At each time step, pick out one cluster with the largest division time when the
network is divided by removing edges, and put this cluster in the dendrogram on the
height of current time;
3. Go back to 2 and increase the timer by 1 until all clusters are in the dendrogram.
The connections among clusters in dendrogram are exactly the ones in cluster
division tree so that we needn’t care much about them. The corresponding
27
dendrogram for the cluster division tree in Figure 2-5 is depicted in Figure 2-7.
Working from the cluster division tree in Figure 2-5, the construction of the
dendrogram starts from the leaf nodes [8] and [9, 10]. Here we didn’t start from [9]
and [10] because the cluster structure will become obvious far before it comes down
to [9, 10] from the top of the dendrogram if there is obvious cluster structure. This
will be mentioned in following discussion.
As it was suggested by Newman, modularity value is calculated for each division of
a network as moving down the dendrogram. The way to identify promising cluster
structure is to look for local peaks in modularity values when moving down the
dendrogram. [Newman 2004] proved that the cross-section of dendrogram at the
position with peak modularity value typically indicates particularly satisfactory
cluster structure. It also shown that there were usually only one or two such peaks. In
cases where the cluster structure is known beforehand, they found that “the positions
of these peaks correspond closely to the expected divisions”.
Figure 2-8: The change of Modularity value when traveling down the dendrogram in Figure 2-7. It can be seen that there is only one peak during the process, which is the same result as [Newman 2004].
The investigation of the simple network in Figure 2-1 shows the similar result as it is
depicted in Figure 2-8. The cross-section of dendrogram is attained by specifying
increasing dendrogram depth. Particularly, the depth is set to be 0.5, 1.5, 2.5…until
the algorithm achieves the bottom of dendrogram. At each step, the cut dendrogram
28
gives a cluster structure whose modularity value is plotted on the figure. It can be
seen that the modularity value achieves a peak value when the depth is 3.5 the
position of which is depicted by a dashed line in Figure 2-7. Moreover, as it was said
above, the position of peak modularity value is at the high level of the dendrogram
and rarely comes to the bottom level.
2.1.6 Cluster structure comparison
The result of searching cluster division gives rise to a good cluster structure of the
network. Meanwhile, different network will have different underlying cluster
organization which can be attained by the algorithm. If we believe that the clusters of
a network have some special significance such as similar function or localization of
proteins, the result of diverse cluster structures from different networks can reflect
potential difference of these networks to some extends.
Therefore, it is reasonable to compare the networks by contrasting their cluster
structures. Specifically, if we consider the process of clustering as classification, the
difference of network can be simple judged by the rate of correct classification.
Recall that the problem in this project is to estimate the correctness of the networks
from other databases compared with the network from MASC database which we
consider as the “right” underlying network of post-synapse. The cluster organization
of the network from databases can be attained by simply running the algorithms in
previous sections.
The problem can be formally depicted as: given the reference network from MASC
denoted as rG and T clusters { rT
r ClCl ,...,1 } returned by cluster structure discovery
algorithm, what is the probability of correct clustering based on network xG from
one of other databases with cluster structure { '1 ,..., TClCl }.
One way to do this is to find the counterpart cluster in { rT
r ClCl ,...,1 } for each cluster
in { TClCl ′,...,1 }. Since there is no information about this, a theoretical approach is to
try all possible match and pick up the one with the highest accuracy. However, as the
number of the clusters goes up, the problem will become intractable and
29
computational intensive if we want to try all possibility. Therefore, the author adopts
a heuristic method to search for a better match between two cluster structures.
Before furthering on the search algorithm, suppose NClCl ,...,1 have been mapped to
rN
r ClCl ,...,1 respectively, where N=min{T, T’}. The probability of correctly
clustering the vertices in the reference network using estimated network is calculated
by:
P =
UT
i
ri
N
ii
ri
Cl
ClCl
1
1
=
=∑ ∩
In this formula, the denominator is the number of vertices in standard MASC
network and the numerator is the total number of vertices which are correctly
clustered into the right cluster based on the match from { NClCl ,...,1 } to
{ rN
r ClCl ,...,1 }.
It can be seen that the maximum value of this probability is
U
UUT
i
ri
T
ii
T
i
ri
Cl
ClCl
1
11
=
′
==
∩.
Therefore, only if the estimated network contains all of the vertices in the reference
network can P have possibility to be 1, because UUT
ii
T
i
ri ClCl
′
==
∩11
=UT
i
riCl
1=
in this
situation.
However, as we can imagine, the estimated network can hardly performs better than
the standard network and normally contains much less vertices. This means that the
estimated network can rarely achieve a high P score as long as its highest value is
much less than 1. To avoid this limitation, we rescale the probability as:
rescaleP =
UUT
ii
T
i
ri
N
ii
ri
ClCl
ClCl
11
1
==
=
∩
∩∑
30
which depicts the probability of correctly clustering the vertices shared in the
estimated network as well as the reference network by applying cluster structure
discovery algorithm on the estimated network.
Both probabilities will be examined in Chapter 4 for each network, but probability P
is used to search for a matching between two cluster structures because we want to
take into account of the coverage of vertices in estimated network. The algorithm is
given bellow:
1. Randomly choose a cluster iCl in '1 ,..., TClCl ;
2. Match iCl to rjCl if r
ji ClCl ∩ achieves the highest value for rjCl ∈
{ rT
r ClCl ,...,1 };
3. Delete iCl and rjCl from the cluster structures respectively, and repeat 1 and 2
until either cluster structure is empty.
The algorithm is repeated for 100 times in this project in order to find a high
probability value P.
After matching the cluster structures, the scores P and rescaleP are calculated as the
evaluator of the estimated network.
The result of cluster comparison will be presented in Chapter 4.
2.2 Functional component comparison
In previous discussion, it has been systematically shown that the networks can be
compared based on cluster structure from the topological perspective. The algorithm
has been reported to be able to discover potential cluster organization with high
accuracy [Newman 2004].
However, because the algorithm is originally designed for general network rather
than wholly for protein interaction network, it is not guaranteed that the details in the
algorithm will have biological significance. For example, the cluster structure is
31
emerged by continuously knocking out the edges with high betweenness value which
is calculated based on shortest path between any two vertices in the network. In
contrast, in molecular network, there is no evidence that two proteins must interact
with each other through the shortest path between them. The inconsistency between
the algorithm and biological mechanism will probably give rise to unexpected result
which will mislead the estimation of the networks, because the discovered cluster
structure, which will be used to judge a network, may not be the most potential one
hidden in the molecule functions in the network.
The functionality of proteins is probably the most significant property within an
interaction network. Therefore, in order to make the network estimation algorithm
biologically significant, the author tries to compare the network based on potential
functional component in a molecular network.
Because the network queried from databases are normally collected from various
literatures or are submitted by research groups from a range of background, it is hard
to distinguish the function component from the network directly. In other word,
given a network, there is no direct and determinate information about which proteins
are involved in a specific function. Fortunately, a heuristic functional component
identification algorithm in biological network has been investigated by Andrew
Pocklington in [Andrew 2005]. This algorithm provides a prediction mechanism to
assist the search for the molecules on basis of function and disease guided by the
topology of molecular interaction networks.
In attempting to predict the set of molecules underlying a given function or
phenotype, the algorithm proposed by [Andrew 2005] assumes that cellular
functions correspond to overlapping sub-networks which, when taken together,
comprise the entire molecular interaction network of the cell. As it was said above,
within this network, experimental data vary widely in terms of coverage, specificity,
noise and functional correlation, together with which make the network filled with
various uncertainty. Therefore, it is wise and straightforward to make use of
probability P(i|D) to measure the possibility of a protein i to be functionally relevant
given the data D.
32
2.2.1 Extrapolating probability
In most cases, the functionally relevant proteins are only partly known so that
extrapolation is required to assign probability values to the entire network. A
topology dependent score is defined by random walking between molecules within a
set.
Formally, given a subset of implicated molecules M which we have confidence that
they are functionally relevant, a way to estimate P(j|D) where j∉M is to marginalize
the joint probability P(j, i|D) where i∈M. By using Bayes rule, P(j|D) can be
calculated as:
P(j|D)=∑∈Mi
DiPDijP )|(),|(
where ),|( DijP shows the conditional probability of j being classified as
functionally relevant given i and data D. For expression convenience, data D will be
omitted in the following discussion.
)|( ijP will be estimated based on the relationship between topology and function. In
[Andrew 2005], )|( ijP is considered as the influence on j when knocking out
protein i, where the “influence” is measured by the change of vertex betweenness as
getting rid of protein i from the network. Instead of using edge betweenness as it was
in cluster comparison, the betweenness jB of vertex j is defined as the expected net
number of times a random walking between all pairs of other vertices passes through j,
averaged over all pairs, where “net” means that “if a walk passes through a vertex and
then later passes back through it in the opposite direction, the two cancel out and there
is no contribution to the betweenness”. [Newman 2003]2.
The random walking can be thought as message passing originating at a source
vertex s on a network and heading for some target t with no idea where t is. Thus, on
each step of its travel, the message moves from its current position on the network to
one of the neighboring vertices with the same probability [Newman 2003]2.
Particularly, suppose that a walk, starting at vertex s and making random moves
around the network until it finds itself at vertex t, comes to vertex j at one moment.
33
The probability that it will come to i on the next step is given by:
ijT =∑
kkj
ij
AA
for j ≠ t
where A is the adjacent matrix of the network in which ijA is 1 if there is an edge
between vertices i and j.
The formula can be rewritten as a matrix form:
1−⋅= DAT
where D is a diagonal matrix with iiD =∑k
kiA .
Since the algorithm will stop whenever it arrives at t, itT and tiT should be zero for all i.
Instead, the target vertex is denoted by removing column t and row t from T without
affecting transitions between any other vertices. The expression above becomes:
1///
−⋅= ttt DAT
where the subscript “/t” means the resulted matrix after removing column t and row t
from corresponding matrix.
For a walk from s, the probability that it comes to vertex j after r steps is given by
jsr
tT ][ / , and the probability that next step is at a neighboring vertex i is ∑
kkj
jsr
t
AT ][ / .
Summing over r from 0 to ∞ , the total number of times of traveling from j to i,
averaged over all possible walks, is ∑
−−
kkj
jst
ATI ])[( 1
/ , which can be denoted as a vector:
V= sTID tt ⋅−⋅ −− 1/
1/ )( = sAD tt ⋅− −1
// )(
where s is defined to be a vector such that
⎪⎩
⎪⎨
⎧=−=
=otherwise0
if 1 if 1
tisi
si
The net flow of the random walk along the edge from j to i is given by the absolute
34
difference | ji VV − | and the net flow through vertex i is half of the sum of
the flows on the adjacent edges:
⎪⎩
⎪⎨⎧ ≠−
= ∑otherwise 1
,for ||21 tsiVVA
I jjiij
i
Let stiI denotes the net flow of vertex i when taking s as the source vertex and t as the
target vertex. Then the vertex betweenness of i is the average of the flow over all
source-target pairs:
)1(21
−=
∑<
nn
IB ts
sti
i
where n is the total number of vertices in the network.
As it was discussed above, the conditional probability P(j|i) is calculated as the
change of j’s betweenness when removing i. With the result of vertex betweenness,
this conditional probability can be calculated as:
P(j|i)= || / ijj BB −α
where ijB / means the betweenness of vertex j when i is removed. The definition of
P(j|i) can be thought as measuring the influence of i on j by removing vertex i and
quantify the change caused by the removal. The parameter α determines the
influence of extrapolated probabilities and is set such that the maximum extrapolated
probability is equal to 1 in this project. For those vertices in the molecule subset M and
vertex j∉M which is disconnected from i∈M, define that P(i|i)=P(j|i)=1.
2.2.2 Scoring sub-network
With the result of P(i) for all molecules, the set of molecules encoding a function and
forming a sub-network can be scored according to the probability of these molecules
to be functionally relevant [Andrew 2005].
Let s(a, b) denotes the probability that the connection from a to b is functionally
relevant. The score s( ς ) for a set of molecules ς can then be calculated as the
35
average of s(a, b) over all possible pairs in ς .
According to the mechanism of protein interaction network, each path from a to b
bλaλλ n...1= represents a possible chain of interaction through which a and b may
influence each other. Let P( λ ) represent the probability of λ to be functionally
relevant, and )(λΦ denotes the probability that a interacts with b through path λ .
s(a, b) can be defined as:
s(a, b)= ∑Λ∈λ ab
λPλΦ )()(
where abΛ denotes the set of paths between a and b in which b doesn’t appear.
According to the assumption of random walking, )(λΦ is defined to be the
proportion of λ among all possible paths in random walking:
)(λΦ = 11 ))()()...(( −adλdλd n
where )( mλd is the degree of vertex m.
Suppose that the choice of path on one vertex is independent from other paths. P( λ )
can be defined as:
P( λ )=P(a)P( 1λ )P( nλ )P(b)
Therefore, the expression of s(a, b) can also be written as a matrix form by defining
an auxiliary matrix W as well as two column vectors Q and R such that
ijW =)()(
jdjPAij
⎩⎨⎧
≠=
=aiai
Qi if0 if1
bii AbPR )(=
where ijA is the adjacent matrix defined in last section. s(a, b) is written as:
s(a, b)= bbTbb
n
nb
Tb QTRQTR /
1///
0// )1(][ −
∞
=
−=∑
The “/b” denotes the matrix/vector after removing row and column b. This guarantees
that b doesn’t appear as an intermediate vertex in a path [Andrew 2005].
As it is discussed at the beginning of this section, The score s( ς ) is defined as the
average of s(a, b) over all possible vertex pairs:
36
s( ς )=)1(
),(
−
∑∈≠
ςς
ςba
nn
bas
where ςn is the number of vertices in ς .
From the final expression of s( ς ), it can be deduced that as ςn decrease to be 1, the
value of s( ς ) will increase to be positive infinity. This indicates that the s( ς ) scores
will tend to decrease with the increase of ςn so that the algorithm will obvious prefer
the sub-network with less molecules, which is out of our expectation. In order to avoid
the influence of the size of sub-network, the original s( ς ) is converted to a z-code
[Andrew 2005]:
z( ς )=n
n
δµςs −)(
where nµ and nδ are the mean and standard deviation for sub-network with n
molecules estimated by randomly sampled sub-networks, and n is the number of
vertices in ς .
Particularly, the random sub-network is generated by creating an ordered list l of all
vertices in a network:
Initialization:
Initialize l with a randomly chosen vertex v;
Loop:
While not all vertices are added into l{
Create a random subset C of neighbors of v;
Append all vertices in C-l to l in a random order;
Set v to be a random vertex in C;
}
Return: ordered list l.
With l, the first n elements form a random sub-network with n vertices which has a
score s( nς ). The score mean nµ and standard deviation nδ can be estimated from a
37
sample of sub-networks with n vertices.
To achieve a stable estimation of nµ and nδ , starting from a sample of 100 lists l,
the number of samples was doubled until the variance of all nµ and nδ is less than
0.1%.
The variance is estimated by jackknife method [Efron 1979]:
1. For all samples of lists l, remove list il and calculate iρ/ , where iρ/ can either be
nµ or be nδ calculated with the lists after removing il .
2. Estimate the variance by:
∑=
−−
=n
ii ρρ
nnδ
1
2/ )ˆ(1ˆ
where ρ̂ is calculated by n
ρn
ii∑
=1/
.
2.2.3 Functional component searching
Recall that z( ς ) gives a score of the sub-network to be functional relevant. In order
to find the most possible set of molecules which are active in a specific function, the
sub-network with the highest z( ς ) sore is required. Since searching all possible
sub-network is a NP-complete problem, heuristic algorithm is needed. [Andrew
2005] suggests a Metropolis-type algorithm [Newman 1999] to search for the
assignment of active or inactive for which the set of active vertices has the maximum
z( ς ) score. The algorithm is described as follows:
Initialization:
Assign a random initial state to each vertex;
Set a “temperature” value T to be maxT ;
Aς is the set of vertices with active state;
Loop:
While T is not decreased to be minT {
38
Select a vertex v randomly, and flip the state. The resulted active vertices set is *A
ς .
Calculate the score change zδ = z( *Aς )-z( Aς );
If 0≥zδ , accept the change at v, else accept the change with probability Tze /δ ;
T=T× n
TT 1
max
min )( , where n is the total number of iterations;
}
While last 100 iterations increase the score by < 510− {
Flip the state of a randomly chosen vertex;
Accept the change if zδ >0 or zδ =0 but the number of active vertices increases;
}
Return: Aς .
According to [Andrew 2005], the value of T for which a decrease in score zδ =-x has
a 50% chance of being active is given by )(21 xT =x/ln(2). Therefore, maxT is set to
be )1(21T and minT is set to be )001.0(
21T .
2.2.4 Functional component comparison
The algorithm of functional component identification gives rise to a set of molecules
with the highest probability to be functionally relevant. It begins with a set of
implicated molecules and tries to extrapolate to others by maximizing the probability
of the active molecule set. With the same implicated molecules, different functionally
relevant proteins will be predicted based on diverse networks. Since the functionally
relevant molecules are resourced from the property of the network, the divergence on
the prediction result is also an embodiment of the variance in these networks.
Moreover, the predicted molecules can act as the proteins that we will spend money
on for future research. Therefore, if the estimated network can predict roughly the
same molecules as those in reference network, it can be judged to perform well.
Particularly, for function Func and network Net, the algorithm in previous sections
39
will give different active vertex set ),( NetFuncς A . Therefore, it is fairly
straightforward to compare the reference networks refNet with the estimated
network Net by contrasting the active molecule set ),( refA NetFuncς and
),( NetFuncς A and calculating the probability that the predicted active vertices in
Net are consistent with those active ones in refNet :
FuncP =),(
),(),(
refA
ArefA
NetFuncς
NetFuncςNetFuncς ∩
The detailed result about functional component comparison algorithm will be
reported in Chapter 4.
41
Chapter 3 Data retrieve
3.1 Data overview and constraint driven data retrieve
As it was stated in Chapter 1, the MASC database is used as the reference to estimate
the molecule interaction network from other databases. The available databases to be
estimated in this project include DIP, MINT, NetPro. The introduction to these
databases can be found in Appendix B.
Facing so many databases, the most straightforward way to retrieve data from them is
to manually set up fixed SQL statement in applications. However, these databases are
only a small part of existing molecule interaction network databases in the world so
that it is almost definite that further research will aggregate many more databases into
the system. The static SQL solution seems to be infeasible under such a situation
because this will make the program intractable and not flexible.
Consequently, the author tried to use a uniform language as the “container” of the data.
An obvious choice of such a language is XML. Normally, an XML document is
conformed to a schema such as DTD which is used for both generating XML from
relational database and parsing the data in applications. Nevertheless, given relational
schemas, the generation of XML schema is nontrivial. In this chapter, the author will
propose a way of automatically generating DTD from relational schemas with a set of
functional dependency as well as inclusion dependency. With the produced DTD,
XML can be generated and parsed with existing technique without much difficulty. In
this case, the whole process can be defined as constraint driven data retrieve.
Relational database is normally thought to be consisted of schemas and instances.
Therefore, most previous researches on relational database publish have mainly
focused on the mapping of schemas as well as instances from relational database to
XML schema which is usually DTD and XML documents. However, as long as
relational model can not contain “set” value because of its atomic value property,
integrity constraints are used to link individual schemas and thus play important roles
42
in the design and maintenance of relational databases. These constraints form another
perspective of the “semantic” of relational database which need propagating to the
XML schemas during database publish.
The proposed data retrieve technique in this project searches for principles for XML
schema design given existing relational schemas with a set of key constraints and
foreign key constraints. In this case, it will mainly focus on the transformation from
relational schemas to XML schema, says DTD, as well as the constraint propagation
from relational schema to the generated XML schema. Both key constrains and foreign
key constraint will be dealt with because these two sorts of constraints are the most
essential ones in relational schemas and thus catch more interest from researchers.
The detailed constraint driven data retrieve algorithm is presented in Appendix A. The
discussion begins with a formal definition of XML functional dependency (FD) which
is the counterpart of constraints in XML domain. A key constraint propagation
algorithm is given based on the definition of XML FD. Then, it will come up with a
foreign key constraint propagation algorithm which is accomplished by pre-mapping
and post-mapping steps. The key constraints and foreign key constraints are both
encoded with XML functional dependency in the generated XML schema.
3.2 Data retrieve implementation
As it was mentioned at the beginning of this chapter, facing various protein interaction
databases, the problem of this project is how to retrieve the data in these databases into
an XML document and how to parse the retrieved XML into semantic data used by
applications.
This can be an easy question if there is a guiding DTD. In Appendix A, an algorithm
has been investigated for propagating key constraints and foreign key constraints from
relational schema to XML schema. With pre-mapping and post-mapping algorithms, a
DTD with a set of functional dependency can be automatically generated given a set of
relational schemas and corresponding constraints. The algorithm has been tested,
although partly manually, before the summer project, on a real application which is a
mini Bayesian network workshop with complicated foreign key constraints.
43
Although the functional dependency of XML schema is the most valuable and original
result of the proposed constraint propagation algorithm, in this project, the DTD is our
main target and is further used to generate and parse XML for applications. Indeed,
there has been DTD directed database publishing technique using Attribute
Translation Grammars (ATG) which is a way of publishing relational data with XML
conformed to a predefined DTD [Michael 2002]. However, for simplicity, the data
mapping from relational databases to XML in this project is done by matching the
attribute names in databases and the element names in generated DTD. The data
records are inserted into XML document according to the matched name of DTD.
The program is implemented by Java using DOM (Document Object Model) parser
with some manual efforts to generate the DTD by propagating the constraint from
relational schema to XML schema. Fortunately, the relational design of protein
interaction network is relatively simple and is no more than many-to-many
relationship as it was discussed in Appendix A. Certainly, simply matching the
attribute names adds some limitation to the generalization of data retrieving, but this
approach is easy to be implemented and has little complexity, which is essential to
build up an efficient data retrieve mechanism.
Moreover, because different databases use divergent types of ID as the key of
molecule entries, it is required to construct the ID mapping between PPID which is the
key of proteins in MASC and other kinds of IDs used in those estimated databases.
Particularly, NetPro uses LocusLink, MINT uses SwissProt and DIP uses GenBank
accession number as their IDs respectively. The ID mapping is mined from various
databases including Ensembl EnsMart Genome Browser (http://www.ensembl.org/
Multi/martview), UniProtKB/Swiss-Prot (http://www.ebi.ac.uk/swissprot/) and
GenBank (http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html), etc. The
results of ID mapping are presented in Appendix C.
As it was said above, aside from the DTD, the set of functional dependency is
undoubtedly another important output of the constraint propagation algorithm. To put
it into a general circumstance, in the area of querying and storing XML, a more
common scenario is: given a specific DTD and a set of relational schemas, retrieve the
data in relational database into an XML document which is conformed to the DTD and
44
holds the key constraints and foreign key constraints in the relational schemas. In this
case, the algorithm in this chapter can serve as a tool for generating an “internal” XML
schema. The data as well as the constraints in relational schema are propagated only to
this internal XML schema rather than the predefined XML schema. These two
schemas are then matched by schema matching [Tova 1998] or object fusion [Yannis
1996] algorithm. The final propagation is finished by mapping the internal XML
document to the XML which is conformed to the predefined XML schema. This is a
more nontrivial problem for further research and is out of the scope of this dissertation.
45
Chapter 4 Result presentation and discussion
This project is implemented with Java programming language. In this chapter, the
author will present the result of experiments on the network retrieved from various
databases. A discussion will then be followed to state the merit as well as weakness
of each algorithm.
4.1 Cluster comparison
With the network estimation algorithms as well as the retrieved data from various
databases, it is time to experiment on real data and evaluate the result. In this section,
the author will firstly demonstrate the estimation result for single network queried
from NetPro, MINT as well as DIP compared with the reference MASC network.
Furthermore, to quantify the “meaning” of evaluation result, the cluster comparison is
tested on MASC network and a noisy version of MASC network itself by adding some
controlled random mutation. The result can be used to interpret the estimation results.
4.1.1 Single database estimation
The protein interaction network queried from MASC database consists of 96 proteins
connected by 222 interactions. The network automatically generated by application
according to the data queried from database is shown as Figure 4-1.
As it was discussed in previous chapters, this network will be used as the reference to
evaluate other networks queried from other databases. Before presenting the result of
cluster comparison, these networks to be estimated are given in Figure 4-2 and 4-3.
Since nothing can be queried from DIP database, there is no presentation for it in the
following discussion.
The performance of NetPro and MINT diverges from the first glance. NetPro
obviously contains more molecule as well as more interactions than those in MINT
and thus outperforms MINT as well as DIP. Because this project means to quantify
this kind of “goodness”, the cluster comparison will give rise to a score for each
46
estimated network.
Figure 4-1: The molecule interaction network queried from MASC database. It contains 96 proteins connected by 222 interactions. Those vertices highly connected to others are allocated relative near to the center in order to keep the network tidy.
Figure 4-2: The interaction network queried from NetPro database. 56 proteins and 94 interactions are involved.
47
Figure 4-3: the molecular interaction network from MINT database. It consists of 22 proteins with 16 interactions.
By running the cluster comparison algorithm, the cluster structures of these networks
are firstly discovered by gradually dividing the networks and searching for the largest
modularity values. Figure 4-4 shows the change of modularity value as the networks
are divided. As it was stated in Chapter 2, there is only one peak value when moving
down the dendrogram. The difference is that the peak values of these networks don’t
come as early as the ones in [Newman 2004], which indicates that the cluster
structures in these protein networks are not that obvious.
Figure 4-4: The modularity values of the queried networks in MASC, NetPro and MINT when moving down the dendrograms respectively. Each curve contains only one peak value indicating the most obvious cluster structure in each network.
The cluster structures discovered in these networks are shown in Figure 4-5 to Figure
4-7.
48
Figure 4-5: The cluster structure in MASC interaction network. There are 18 discovered clusters in which two of them dominate most of the proteins and interactions.
Figure 4-6: The cluster structure in NetPro database. There are 9 discovered clusters in which, again, two of them dominate most of the molecules as well as interactions.
49
Figure 4-7: The cluster structure in MINT database. It consists 7 clusters but performs obvious worse than NetPro.
With these cluster structures, the network can be compared by calculating the
probability that the molecules are correctly clustered as it was discussed in Chapter
2.
The result is shown in the table below:
DatabasesDatabase Estimation NetPro MINT DIP
Number of shared molecules with MASC 51 18 0
Number of correctly clustered molecules 23 6 0
Probability of correct clustering 24.0% 6.3% 0% As it was presented it in Chapter 2, the probability of correctly classifying the
molecules in MASC database using networks from other databases is calculated by
the number of correctly clustered proteins divided by the number of molecules in
MASC database which is 96 here.
According to the calculated cluster comparison scores listed in the table above, the
NetPro database obviously outperforms MINT and DIP, which is within our
expectation.
4.1.2 Robustness of cluster comparison algorithm
Normally, the constructed networks from databases contain some noise. The network
comparison should be robust to the noise to some extends. This means that, even
though the network is slightly different from the original MASC network, the
algorithm should “ignore” the noise and judge them as the same network. To
50
investigate the robustness of cluster comparison algorithm, the author compares the
MASC network with a noisy version of the same network by mutating the
interactions.
Particularly, the noise is added into the network by connecting or disconnecting the
molecules in MASC network with probability noiseP if the original edge is
disconnected or connected. Then the noisy network is estimated with respect to the
original MASC network by running the cluster comparison algorithm. The score of
the noisy MASC network compared with the original MASC network when the noise
increases from 0 to 90% is given in Figure 4-8. Each comparison score is averaged
on 30 experiments with the same amount of noise.
Because the noisy network contains exactly the same molecules as those in original
MASC network, the maximum comparison score is 1 indicating that they are the
same network when there is no mutation.
Furthermore, it is straightforward to read from Figure 4-8 that the clustering
comparison score for perturbed MASC network decreases exponentially with the
increase of noise and finally tends to converge to a low level at about 13%.
Exponential decrease of comparison score indicates that the cluster compassion
algorithm is not that robust to the noise in the network when the noise is within a
small range near 0 because slight mutation can result in huge change in comparison
score (decrease from 1 to 60%). Fortunately, for a network constructed from other
databases, the percentage of noise is normally out of this sensitive region (less than
1.5% noise).
Moreover, the comparison score converges to a non-0 value as the noise increases.
This indicates that cluster comparison algorithm can not discriminate differential
networks when the score is lower than a boundary which is around 16%.
For the network form NetPro, the maximum comparison score is 51/96=0.53125 so
that the comparison score, which is 24%, can be rescaled to be 45% which means
that NetPro can correctly cluster 45% of the molecules in the shared 51 proteins. On
the other hand, the cluster comparison score of MINT, 6.3%, can be rescaled to be
51
33% according to the maximum comparison score 18/96=18.8%. Therefore, both
original comparison scores and the rescaled scores which are within an appropriate
region (neither too sensitive nor too dull) designate that NetPro obviously
outperforms MINT.
Figure 4-8: The cluster comparison score of the noisy MASC network over the percentage of noise. There are three regions as the amount of noise increases. When noise <1.5% and the score is more then 60%, the comparison algorithm is over-sensitive and can not be used as a robust estimator. When 1.5%<noise<30% and the cluster comparison score is between 16% and 60%, the algorithm shows adequate ability and robustness to discriminate the noisy network so that it can be used as good indicator of network divergence. When noise>30% and the comparison score is smaller than 16%, there is little change on the score regardless of the noise percentage so that the score tells little about how good the network is. Note that the maximum score is 1 in that the noisy network contains all vertices in the original MASC network.
4.1.3 Databases collaboration
Having known that NetPro is good at constructing the molecule interaction network
based on the comparison with MASC database, we expect that the collaboration of
databases will perform even better. To explore this, the author combines NetPro and
MINT together to form a new network and calculates the comparison score.
In this situation, there are 52 shared proteins in the combined network and MASC
network. 27 correctly clustered molecules give rise to a better comparison score
27/96=28%. Besides, as it has been done before, this score can be rescaled to 52%
according to the maximum cluster comparison score of this network: 52/96=54%.
Although MINT demonstrates poor performance on constructing MASC network,
52
the combination of NetPro and MINT still shows a slight enhancement of
comparison score which verifies that the collaboration of databases can be an
possibility to construct better molecule interaction networks.
4.2 Functional component comparison
Up to now, the result of cluster comparison has been presented and shown reasonable
performance on discriminating differential networks from various databases. As it
was stated in last part, the cluster comparison can only be discriminative enough
when the rescaled score is neither too large nor too small in that the algorithm is
over-sensitive when the score is too large and it is over-dull when the score is small.
Moreover, although the cluster comparison result demonstrates certain ability to
estimate different networks, there is indeed no “correct” answer to verify these
results. In this case, another method will be tried to in order to validate the previous
algorithm.
The following sections will present the result of functional component comparison.
4.2.1 Component score and z-score
As it was mentioned in Chapter 2, the component score is calculated according to the
probability of individual molecule to be functionally relevant which is computed
using topological property (random walking) of the interaction network. Figure 4-9
depicts the histogram of the probability for those molecules (Spatial learning is used
as the function which will be presented soon). It is obvious that only a small part of
proteins have high probability to be involved in the function, which is consistent to
biological significance such that a cellular function is normally performed by a
bunch of proteins rather then all of the proteins.
According to the algorithm in Chapter 2, the functional component is searched by
heuristically climbing to a maximum s( ς ) score which is the probability of
component ς to be functionally relevant. Recall that the score s( ς ) tends to
decrease with the increase of the number of molecules in component ς and thus
will prefer the component with less proteins. Figure 4-10 depicts the component
53
score over the number of molecules in the component. Each score is calculated as the
means of 800 randomly sampled sub-networks as it was described in Chapter 2.
Figure 4-9: The histogram of the probability to be functionally relevant. Only a small part of molecules have large probability (in the right part of the figure), which is consistent with biological significance.
Since we have no prior knowledge about the size of functional component, this
preference of component size will mislead the functional component searching.
Figure 4-10: The component score over the number of molecules in the component. Each score is calculated as the mean of 800 samples, and the component contains at least 2 proteins or the score will go infinity.
Therefore, the algorithm uses z-score to recover this problem. Particularly, the mean
54
and standard deviation used to calculate z-score are subject to a low variance
computed by jackknife estimator. Figure 4-11 depicts the variance of mean and
standard deviation of component score with different sample capacity.
Figure 4-11: Jackknife variance of component score mean (left) and standard deviation (right) over the number of molecules in component. The variance values are plotted against different sample capacity 800, 1600 and 3200.
It can be seen that, regardless of sample capacity, the variance exponentially
decreases with the increase of the number of molecules in the component. This is
because that when there is only a small amount of proteins in a component, the
randomly sampled component vary a lot from one to another. On the other hand, as
the number of proteins growing, the sampled components contain more overlaps
which will reduce the variance.
Moreover, as the sample capacity increases, the value of variance decreases, which is
indicated by the curves moving towards the axis in Figure 4-11. This statistical fact
enables the variance to converge to zero by increasing the sample capacity, which
provides the foundation of finding stable means and standard deviation being subject
to small variance.
Therefore, with sufficient amount of component sample (about 3200 or 6400 in this
project), the mean and standard deviation with small variance can be computed.
These values are then used to normalize the component score s( ς ) to be z-score z( ς )
which has little bias on component size.
55
4.2.2 Single database comparison
The algorithm of searching for active vertex set with maximum z-score is applied on
the network from MASC database as well as other estimated database, NetPro and
MINT.
Take MASC network as an example. Figure 4-12 shows a typical situation of the
change of component z-score during the search algorithm. As expected, the score
keeps increasing until coming to a local maximum.
Figure 4-12: The change of component z-score when searching for the local maximum component z-score.
Those implicated proteins in MASC for schizophrenia and spatial learning have been
identified by manual curation of published literature. Therefore, with the same
implicated proteins, we compare the prediction result using the networks from
different databases.
For spatial learning, the implicated phenotype includes A0018 and A0011. The
predicted phenotype using MASC database is given in the table bellow. This
prediction shares 13 proteins with the result reported by [Andrew 2005] which
contains 20 predicted molecules. The difference is caused by using different version of
MASC network.
PPID A0001 A0002 A0003 A0007 A0008 Name NR1 NR2A NR2B ACTN Calmodulin
56
PPID A0010 A0011 A0012 A0016 A0017 Name Spectrin CAMK2A PLCg-1 Sap102 Tubulin PPID A0018 A0024 A0091 A0095 A0126 Name Src SynGAP Actin Dynamin Grb-2 PPID A0138 A0143 A0144 A0196 A0200 Name RAF1 AKAP 150 PKCepsilon RasGAP H-Ras PPID A0262 A0419 A1851 Name NOS1 Myoxin CaMKII Beta
On the other hand, the predicted phenotype using the network generated by NetPro
database is shown in the table bellow:
PPID A0002 A0003 A0018 A0020 A0030 Name NR2A NR2B Src PTP1D FAK2 PPID A0095 A0126 A0138 A0181 A0265 Name Dynamin Grb-2 RAF1 Erk2 Erk1 PPID A0268 A0434 A1851 Name Rsk-2 EGF-164 CaMKII Beta
These data show that NetPro can correctly recover 7 proteins to be functionally
relevant within 51 shared proteins with MASC database. Therefore, according to the
comparison method, the probability of NetPro network to correctly predict the
functionally relevant molecules for spatial learning can be computed as: learningSpatial
NetProP =7/23=30.4%
In contrast, the predicted molecules using the network from DIP database is shown in
the table bellow:
PPID A0002 A0011 A0012 A0018 A00124 Name NR2A CAMK2A PLCg-1 Src Src
It can be seen that there are 4 correctly predicted molecules by using MINT network
which contains 18 shared proteins with MASC database. Therefore, it can be
calculated that learningSpatialMINTP =4/23=17.4%.
57
Consequently, according to the predicted phenotype for spatial learning function
using networks from different databases, NetPro, again, outperforms MINT.
To verify this, the author uses schizophrenia as another function and calculate niaSchizophre
NetProP and niaSchizophreMINTP . For schizophrenia, the implicated proteins are A0010,
A0016 and A0107. The prediction result using the network from MASC is given in
the table bellow:
PPID A0002 A0003 A0008 A0009 A0010 Name NR2A NR2B Calmodulin PRKA9 Spectrin PPID A0011 A0012 A0014 A0015 A0016 Name CAMK2A PLCg-1 Chapsyn-110 DLG1 Sap102 PPID A0017 A0018 A0020 A0033 A0075 Name Tubulin Src PTP1D Rap2 Shank PPID A0081 A0091 A0095 A0107 A0123 Name Shank2 Actin Dynamin GKAP PP2B PPID A0138 A0143 A0144 A0266 A0292 Name RAF1 AKAP 150 PKCepsilon MEK1 INA PPID A0333 A1851 Name SNAP25 CaMKII Beta
To estimate NetPro database, the predicted phenotype is calculated using the same
implicated proteins and the network from NetPro:
PPID A0001 A0003 A0012 A0013 A0014 Name NR1 NR2B PLCg-1 DLG4 Chapsyn-110 PPID A0015 A0016 A0030 A0086 A0095 Name DLG1 Sap102 FAK2 ZO-1 Dynamin PPID A0107 A0112 A0138 A0144 A0168 Name GKAP b-catenin RAF1 PKCepsilon PKCbeta PPID A0266 A0268 Name MEK1 Rsk-2
It can be seen that there are 10 correctly predicted proteins using the network from
58
NetPro. Therefore, according to the comparison score, the probability of NetPro
network to correctly predict the functionally relevant molecules for schizophrenia
can be computed as: niaSchizophreNetProP =10/27=37.3%, where 27 is the number of predicted
molecules in MASC network.
The same computation on MINT database yields the proteins as follows:
PPID A0002 A0003 A0011 A0013 A0107 Name NR2A NR2B CAMK2A DLG4 GKAP
There are 4 correctly predicted molecules using MINT network for function
schizophrenia. Therefore, it can be calculated that learningSpatialMINTP =4/27=14.8%.
Up to now, the functional component comparison score for two functions both shown
that NetPro performs better than MINT. The same with cluster comparison, we aim
to “interpret” these scores and to see how discriminative these scores are to those
differential networks by controlling the difference and observing the tendency of the
scores.
4.2.3 Robustness of functional component comparison
The robustness of functional component comparison is researched under the same
methodology with cluster comparison. A noisy version of MASC network is
generated by adding or deleting the edges with specified probability. Then, the score
of this network is calculated with respect to the original MASC network. Figure 4-13
shows the scores of mutated networks over the percentage of noise. Each score is
calculated on 60 networks with the same proportion of noise.
It can be read from Figure 4-13 that the functional comparison score tends to
decrease with the increase of noise which stands for the controlled quantity of
“difference” of the network. For both functions, there is an abnormal score region
from around 0.7. This is probably because of the effect of local maximum. However,
the comparison scores in other region are smoothly changed, which means that the
functional component comparison score is appropriately discriminative to those
differential networks.
59
Figure 4-13: The functional component comparison score for the mutated network over the proportion of the noise. The smoothly changed score (except the region around 0.7) show pretty good discriminative ability to different networks.
4.2.4. Database collaboration
Cluster comparison method has shown slightly enhanced performance of the network
generated from combined NetPro and MINT databases. In this section, the database
collaboration is tested again under the criteria of functional component comparison.
Briefly, for spatial learning function, the combined network can correctly predict 9
proteins with the same implicated molecules, which forms an increased comparison
score 9/23=39.13%. Besides, the prediction accuracy of the mixed network in
schizophrenia is 10/27=37.3% which is unchanged compared with original NetPro.
Again, database combination seems to give a positive effect on the generated
network. However, since MINT shows fairly poor performance on constructing
MASC network, the joining of it is much like adding a drop of water in to a river.
Therefore, because of the limitation of available databases, we can not arbitrarily
judge the influence of database collaboration until more experiments on more
databases are done.
60
4.3 Discussion
From the results of previous sections, it shows that cluster comparison and functional
component comparison have come to a consistent result on the estimation of the
networks from different databases, which indicates that these two methods can have
some resolving power on those differential networks. However, as it was reported
earlier, these two algorithms have their weakness respectively.
For cluster comparison, it only works well within a relative narrow score region or it
will be either over-sensitive or over-dull to the change of the network. This is the
direct cause of hierarchical clustering algorithm. Since divisive clustering
continuously knocks out those edges with the highest betweenness score until the
network is divided into individual vertices, a slight mistake on the higher level of
dendrogram will result in tremendous change of the result. The noise within the
network is exactly a source of the mistake. As is can be seen in Figure 4-8, only less
then 1% noise can cause the cluster comparison score to decrease from 1 to 70% or
even lower.
As the permutation increases, the “noise” has been qualitatively changed to be
“errors” in the network. At this time, the cluster comparison algorithm is pretty much
capable of discriminating these errors by estimation score, which is indicated by the
smoothly decreased score with a reasonable speed in the middle part of Figure 4-8.
Moreover, no matter how the network is permutated, it will be finally divided into
clusters which are then matched to those in the reference network. As long as the
estimated network and reference network share certain amount of vertices, the
matched cluster structure will produce a non-zero score. This is why the cluster
comparison score converges to a non-zero score in the end. Therefore, when the
noise is too high, it is sill not appropriate to use cluster comparison algorithm
because the score is not discriminative enough to quantify the difference in the
network.
In this case, it can be concluded that cluster comparison algorithm can only be used
to estimate the network which is neither too different from the reference network nor
61
too similar to the reference network. This rule can be predominated by the scale of
cluster comparison score. Particularly, from the observation of Figure 4-8 as well as
the research during the project, a rescaled cluster comparison score between 0.2-0.6
will be the most appropriate region to use cluster comparison algorithm as the
estimator of those differential networks.
As to the functional component comparison method, it can be seen that it is better at
network discrimination because of its smoothly decreased comparison score when
the noise increases. However, the problem comes when there are only few predicted
molecules in the reference network. To put it into extreme, if there are only 2
predicted molecules in the reference network, and 1 in the estimated network, the
score will be 50% although there is indeed little information about the “goodness” of
the estimated network from this score.
Therefore, a restriction of functional component comparison is that the number of
predicted molecules from the reference network can not be too low. To quantify this,
we put a boundary of 15 which is the required lowest number of predicted proteins in
reference network.
Moreover, the identification of the functional component uses heuristic method to
search for the molecules with local maximum z-score. As a common sense, this
algorithm suffers from the problem of local maximum that requires multiple runs to
find a better result, which will make the algorithm computationally demanding.
All in all, it can be seen that both cluster comparison and functional component
comparison algorithms have their restriction in order to ensure that they can work
appropriately. In this case, they naturally form an enhancement to each other and can
be used together to estimate a network. A promising and systematical combination
method of these two algorithms should be further investigated and will be my future
work in this area.
63
Chapter 5 Conclusion and future work
5.1 Conclusion
After the long chase of detailed discussion of algorithms and the presentation of
analysis result, the differential protein interaction networks have been constructed
and contrasted based on both topological property and biological significance.
Cluster comparison considers the molecule interaction network to be consisted of
clusters which are those important components dominating the properties of a network
and thus standing for the difference of divergent networks. The cluster structure
identification algorithm looks for the edges in the network that are most “between”
vertices, and continuously knocks out these edges until only individual vertex is in the
cluster. In this project, the author brought forward a concept of concise graph which is
a more compact expression of original network without loss of information and is
especially useful for network with scale free property. The calculation of shortest path
betweenness can be promoted by message propagation and back propagation on
concise graph as well as efficient set operation such as hash set and hash map.
The process of dividing the network by removing the edges with the largest
betweenness can be expressed as a dendrogram in which cluster structure can be
acquired by cutting the tree somewhere. Modularity value of cluster structure is used
to decide the position on which the dendrogram should be cut.
With the cluster structures, taking MASC database as the reference network,
differential networks are estimated by the probability of correctly clustering those
molecules in MASC network (comparison score) or those molecules shared by MASC
and estimated network (rescaled comparison score). The comparison result gives rise
to an estimation of the networks from various databases and demonstrates that the
network from NetPro outperforms those of MINT as well as DIP database.
To understand the meaning of cluster comparison outcome, the robustness of the
algorithm is tested by contrasting a noisy MASC network with the original one. The
64
result shows that cluster composition can only be discriminative enough when the
rescaled comparison score is within a range (0.2-0.6).
Because of the limitation of cluster comparison method as well as the inconsistency
between shortest path betweenness and biological mechanism, functional component
comparison is adopted to verify the conclusion from cluster comparison and to
investigate a contrast method based on biological significance.
The functional component identification algorithm assigns a probability to be
functionally relevant to each molecule in the network. The probability values of
implicated proteins in annotation are set to be 1 and are then used to calculate the
marginal probability of those extrapolated proteins based on the topological property
of the network.
With the probability of each molecule, a search algorithm is applied to find a
component with the highest possibility to be functionally relevant. It uses a
Metropolis-type algorithm which is a heuristic method searching for a local maximum
z-score. Although computationally intensive, several runs of the searching algorithm
can achieve pretty reasonable and robust comparison result.
The algorithm is carried out on two synaptic functions. Although rooted from
biological significance, functional component comparison achieves the same
estimation result on NetPro, MINT and DIP as it was in cluster comparison based on
topological property.
Moreover, both cluster comparison and functional component comparison indicate
that collaboration of different databases has the possibility to enhance the
performance of the combined network. Nevertheless, more databases are required for
the verification of this statement because the poor performance of MINT and DIP can
not provide enough evidence of promotion.
Aside from two comparison algorithms based on different perspectives of protein
interaction network, the author also presents a constraint driven data retrieve
technique. The algorithm utilizes the functional dependency (key constraint) and
inclusion dependency (foreign key constraint) in relational schema to construct XML
schema which is DTD in this project as well as a set of functional dependency in
65
XML domain. The result DTD is used to guide the data retrieve from relational
database as well as the parse and validation of the retrieved XML document.
5.2 Future work
Following the result of this project, a ton of further investigation is required.
In the first place, more databases shall be researched in order to verify the capability
of two comparison algorithms on differential networks especially in the area of
robustness, discriminative strength, computational efficiency, etc.
It can be seen that there was little presentation about the interpretation of the cluster
structure as well as the functional component identification result. For example, from
the result of functional component identification, there is significant overlap in
prediction between different phenotypes, and the likelihood of a protein having
multiple predicted phenotype increases with larger degree in the network [Andrew
2005]. Although these phenomenon are beyond the substance of this project, deeper
research is deserved based on more abundant data.
Moreover, the only reference database in this project is the one from MASC database.
The estimation of other databases based on this sole network will give rise to bias
which means that the databases performing well on MASC network are not
necessarily good at constructing other networks. Therefore, when comparing the
networks, more constraints such as protein domain, specious, interaction type, etc
should be considered in order to narrow down the scope of comparison result.
As to the data retrieve technique, constraint propagation from relational schema to
XML schema is now an intensive research area and has no uniform standard.
Therefore, the algorithm in Chapter 3 should be tested with much more care as well
as more complicated scenarios. Besides, as it was mentioned in Chapter 3, the
generated XML schema with a set of functional dependency can be used as an
internal schema during the database publishing by matching the given schema with
this internal schema. This idea has its value on database publish with constraint
which is a demanding but weak area.
66
In all, the investigation of this project gives rise to a bunch of cogent results which
can contribute to current research of MASC network. Two differential protein
interaction network comparison algorithms as well as the constraint driven data
retrieve mechanism provide a fundamental starting point for further research which
the author will continue devoting into.
67
Appendix A Constraint Driven Data Retrieve
The discussion begins with a formal definition of XML functional dependency (FD)
which is the counterpart of relational constraints in XML domain. A key constraint
propagation algorithm is given based on the definition of XML FD. Then, it will come
up with a foreign key constraint propagation algorithm which is accomplished by
pre-mapping and post-mapping steps. The key constraints and foreign key constraints
are both encoded with XML functional dependency in the generated XML schema.
A.1 XML functional dependency and key constraint propagation
To propagate constraints from relational schema to XML, the counterpart of these
constraints in XML has to be firstly set up in order to describe the semantic of the
constraints in XML domain. This section will mainly focus on a definition of XML
functional dependency based on previous work. The algorithm for key constraint
propagation will be given at the end of this section.
A.1.1 Formalism of DTD and XML
It has been a popular approach to express DTD by context free grammar which has
more solid theoretical foundation. However, in this section, an extended formal
definition of DTD [Wenfei 2001] will be used as the most essential concept for the
definition of functional dependency.
In the first place, for convenience of explanation, the author lists some reserved
notation that will be widely used:
ELE is a set of all element names in DTD;
ATT is a set of attribute names in DTD. Each element in ATT starts with “@”;
VAL is a set of string values in attributes and elements;
ENID is a set of element identifiers which uniquely identify an element in an XML
document.
68
Definition: A DTD is defined as a quintuple vector D<E, A, TPM, BE, root> where
(1) E ⊆ ELE is a set of element names;
(2) A ⊆ ATTR is a set of attributes;
(3) TPM is a type mapping from E to element type definition which is a string value or
a regular expression. For each e∈E, TPM(e)∈VAL which means that TPM(e)=string,
or a regular expression which is denoted as r → e′ | r|r |r+r| *r |ε , where e′ ∈E, “|”
means union, “+” means string connection, “*” is Kleene closure and ε denotes
empty element;
(4) BE is a mapping from E to A which indicates the affiliation between elements and
attributes. Suppose @a∈A, we say that @a is defined on e∈E iff @a∈BE(e);
(5) root∈E is the root of DTD.
For example, according to the definition above, the DTD segment below can be
rewritten by the formal definition of DTD as the right part:
Based on the definition of DTD, a DTD path is used to retrieve the elements and
attributes in a DTD. The DTD path is defined as below:
Definition: A DTD path for a given DTD D<E, A, TPM, BE, root> is simply a string
s= nsss ...10 and is denoted as s>D iff:
(1) 0s =root;
(2) for i>0 and i<n, is ∈E ∪ string is included in the regular expression sequence of
TPM( 1−is );
(1) E={Course_root, Course, Name, Lecturer};
(2) A={@cid};
(3) TPM(Course_root)=Course*; TPM(Course)= Name+Lecturer;
TPM(Name)=string; TPM(Lecturer)=string;
(4) BE (Course)={@ cid};
(5) root=Course_root.
<!ELEMENT Course_root (Course*)>
<!ELEMENT Course (Name, Lecturer)>
<!ATTRIBUTE Course cid
CDATA #REQUIRED>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Lecturer (#PCDATA)>
69
(3) ns ∈E ∪ A ∪ string is either included in the sequence of TPM( 1−ns ) or ns =@att,
where @att∈ BE( 1−ns ). The latter case indicates that ns is an attribute which is
defined on 1−ns .
In the following discussion, *D ={s|s > D} denotes all of the paths in DTD D,
while −D ={s= nsss ...10 | s∈ *D ∧ ns ∈E} denotes the set of paths ending with elements.
It can be inferred that *D - −D is the paths ending with attributes or string value.
In the example above, *D ={Course_root, Course_root+Course,
Course_root+Course+Name, Course_root+Course+Name.string,
Course_root+Course+Lecturer, Course_root+Course+Lecturer.string,
Course_root+Course+@cid}, while −D ={Course_root, Course_root+Course,
Course_root+Course+Name, Course_root+Course+Lecturer}, where “+” means
connection of string.
Till now, the XML schema has been defined and can be accessed by DTD path. As the
instance of XML schema, an XML document can also be modeled as a tree-structured
graph [Peter 2001].
Definition: An XML tree is defined as XT<N, ENIDM, ELEM, ATTM, xroot>
(1) N ⊆ ENID is a set of nodes in XT, each element in N is an unique identifier of a
node in XML;
(2) ENIDM is a mapping from N to ELE, which combines the node identifiers in XML
with specific elements in DTD;
(3) ELEM is a mapping from N to *NVAL ∪ , and is the counterpart of TPM in the
definition of DTD;
(4) ATTM maps a node in N and an attribute to a specific value in VAL, and can be
denoted as ATTM: VALATTN →× ;
(5) xroot∈N is the root of XT.
In this definition, N is the whole set of elements in an XML. For nodes n, n’∈N, if n’
∈ELEM(n), we say that n is the parent of n’, or n’ is the child of n in this XML.
70
The definition above only defines the structure of an XML document. It doesn’t
explicit the relationship between XML and corresponding DTD. The definition of
XML document which is conformed to a DTD can be directly defined [Wenfei 2001]
without any more work, but this will be delayed until the next section after the
definition of DTD realization——DTD cast function.
A.1.2 DTD cast and XML functional dependency
Now, the DTD and XML have been formally defined separately. The problem comes
to how an XML can be generated from a DTD and can be conformed to a DTD as well.
This is a same problem with the realization of class to be an object. With the same
thought, the generation of XML from a given DTD can be done by a “cast” function.
There has been some original work on this such as [Marelo 2002], but most of these
contributions are limited within the DTD which is generated by single relational
schema (table). This concept is generalized to multiple schema situations here by
foreign key constraint propagation in next section.
Definition: A cast function cas of DTD D<E, A, TPM, BE, root> is defined as a
mapping from *D to ENID ∪ VAL ∪ null such that:
(1) if s∈ −D , cas(s)∈ENID ∪ null;
(2) if s∈ *D - −D , cas(s)∈VAL ∪ null;
(3) for s1, s2∈ENID, if cas(s1)=cas(s2), then s1=s2, This is the same semantic of
“node equality” in [Peter 2001];
(4) For s= nsss ...10 , if cas( 1−is )=null, then cas( is )=null, where ni <≤1 .
All possible cast functions on DTD D is denoted by: CAS(D)={cas| ∃ s(s∈ *D ∧
cas(s) ≠ null)}.
It can be seen that each non-null cas rule in CAS(D) maps (realizes) a DTD path to an
element or a string value in XML. In this case, as it will be seen, the whole set of cast
rules will be a realization from a DTD to an XML tree. Moreover, cast function sets up
the relationship between XML and DTD which are separately defined in the previous
section.
71
Definition: A path s= nsss ...10 is legal for a DTD D<E, A, TPM, BE, root> iff
cas∈CAS(D).
According to the definition of CAS(D), cas( nsss ...10 ) ∈ CAS(D) means that for
ni ≤≤0 , cas( isss ...10 ) ≠ null.
From now on, CAS(D) implicitly refers to the set of cast functions on legal DTD paths
of D. Moreover, the following discussion assumes that the used paths are all legal.
As it was said above, there have been previous work on the formalization of XML, but
most of the research paid much attention to the data or content of XML instead of the
semantics [Susan 2002] as it was defined above. The cast function combines the XML
and DTD together and provides an opportunity to specify an XML tree conformed to a
DTD according to a cast function.
However, the definition of cast function above is only a general mapping from DTD
path to XML elements or string values for elements and attributes. In order to ensure
that an XML is conformed to a corresponding DTD, an XML tree based on the
definition of cast function can be derived as below:
Definition: For a given DTD D<E, A, TPM, BE, root> and a corresponding cast
function cas∈CAS(D), an XML denoted by cas_tree(D, cas) is an XML tree XT<N,
ENIDM, ELEM, ATTM, xroot> such that:
(1) N={n|( n∈ENID) ∧ ∃ s (s∈ *D ∧ n=cas(s))};
(2) for a legal DTD path s= msss ...10 , and a node n ∈ N, if n=cas(s), then
ENIDM(n)= ms . This rule shows that if a path is cast to an element identifier in an
XML, the node is mapped to the last element in DTD path by ENIDM function. This is
exactly a simple semantic of XPath;
(3) if n=cas(s), then ELEM(n)={cas( 's )| 's =s+I ∧ cas( 's )∈CAS(D), I ∈E ∪ VAL},
where “+” means connection of DTD path;
(4) given @att∈ A, if n=cas(s) and cas(s+@att)∈ CAS(D), then ATTM(n,@att)=
cas(s+@att), where “+” means connection operation.
72
The definition of cas_tree(D, cas) itself is the process of realizing a DTD to be an
XML according to a cast function on the DTD. In this case, the cas_tree function acts
as an “assignment” of a DTD. The result of the assignment is an XML document.
For example, given a DTD segment D:
<!ELEMENT Books (book*)> <!ELEMENT Book (Title, Copys*)> <!ATTRIBUTE Book ISBN CDATA #REQUIRED> <!ELEMENT Title (#PCDATA)> <!ELEMENT Copys (Copy*)> <!ELEMENT Copys (CopyNumber, Borrowed)> <!ELEMENT CopyNumber (#PCDATA)> <!ELEMENT Borrowed (#PCDATA)>
and a cast function cas:
cas(Books)= 0n ;
cas(Books+Book)= 1n ; cas(Books+Book+@ISBN)=’123456’; cas(Books+Book+title)= 2n ; cas(Books+Book+title.string)=’Java Core’; cas(Books+Book+Copys)= 4n ; cas(Books+Book+Copys+Copy)= 5n ;
cas(Books+Book+Copys+Copy+CopyNumber)= 6n ; cas(Books+Book+Copys+Copy+CopyNumber.string)=’1’; cas(Books+Book+Copys+Copy+Borrowed)= 7n ; cas(Books+Book+Copys+Copy+Borrowed.string)=’Yes’;
where string means the string value of the element.
Figure A-1: The XML tree generated by cast function. The items in brackets are element identifier assigned by the cast function, and the elements outside the brackets are the assigned DTD elements or attributes.
The generated XML tree by cas_tree(D, cas) can be denoted by Figure A-1. The items
in brackets are element identifier assigned by the cast function, and the elements
73
outside the brackets are the assigned DTD elements or attributes.
It can be seen that if there is a different set of cast function on the same DTD, the
generated tree will be expanded to be a bigger one. For example, another cast function
is defined as:
cas(Books)= 0n ;
cas(Books+Book)= 1n ; cas(Books+Book+@ISBN)=’123456’; cas(Books+Book+title)= 2n ; cas(Books+Book+title.string)=’Java Core’; cas(Books+Book+Copys)= 4n ; cas(Books+Book+Copys+Copy)= 8n ;
cas(Books+Book+Copys+Copy+CopyNumber)= 9n ; cas(Books+Book+Copys+Copy+CopyNumber.string)=’2’; cas(Books+Book+Copys+Copy+Borrowed)= 10n ; cas(Books+Book+Copys+Copy+Borrowed.string)=’No’;
The generated XML tree together with the previous one can be expressed as Figure
A-2.
Figure A-2: The generated XML tree by two different cast functions. The subtree rooted at 8n is extended by the second cast function.
It can be seen that as long as cas∈CAS(D), the repeated use of different cas can
produce arbitrarily complex XML conformed to a specific DTD D.
The definition above explains how to construct an XML tree with cast function.
However, arbitrary cast including the cast of illegal DTD path is usually not
74
conformed to a DTD or can be derived from an arbitrary XML. A more common
situation is to restrict a set of cast functions from a given XML and DTD, which is
defined as follows.
Definition: Given a DTD D and an XML tree XT that is conformed to D, a minimal set
of cast function CAS(D,XT) is defined as:
{cas| cas∈CAS(D), cas_tree(D, cas) is a subtree of XT, sca ′¬∃ ∀ s ( sca ′ ∈CAS(D),
s∈ *D , (cas ≠ sca ′ ∧ sca ′ (s) ≠ null) → sca ′ (s)=cas(s))}.
This means that there are no two different cast functions (cas and sca ′ ) in CAS(D,XT)
which make sca ′ (s)=cas(s) to be true if sca ′ (s) ≠ null. This definition makes sure
that all cast functions in CAS(D,XT) for a specific DTD D and an XML XT are not
redundant——any different cast function will generate different element, attribute or
string value in an XML.
After a long travel about the definition of cast function, the XML functional
dependency can be finally defined as below:
Definition: Given a DTD D and an XML tree XT, for S, S ′ ⊆ *D , S ′ is functional
dependent (FD) on S, which is denoted by S → S ′ iff:
∀ cas ∀ sca ′ ( sca ′ , cas∈ CAS(D,XT), ∀ s(s∈ S ∧ cas(s) ≠ null ∧ cas(s)= sca ′ (s))
→ s′∀ ( s′ ∈ S ′ ∧ cas( s′ )= sca ′ ( s′ ))) is true, where “ → ” in this statement is the
logic operation “imply” rather then “dependency”.
Because all of the discussion in the following sections is about the usage of XML
functional dependency, the examples for this definition will be delayed until the
detailed discussion of constraint propagation.
Till now, the functional dependency on XML schema has been defined based on cast
function which is a realization mechanism of DTD into XML. Primitively, all these
definitions are established on DTD path and the formalized expression of DTD and
XML.
In next section, the relational functional dependency propagation will be solved by
re-expressing the same semantic of key constraint in XML domain——XML
75
functional dependency. After this, the foreign key constraint propagation will be
discussed in more complex situations.
A.1.3 Key constraint propagation
The purpose of this section can be generally described as: given a relational schema
T( 1A , 2A , …… nA ) and corresponding key constrain RFDΣ , find a DTD D<E, A, TPM,
BE, root> as well as a set of FDs FDΣ which have the same semantic as T and RFDΣ .
The whole process is a function mapping from a relational schema with key constraint
to XML schema with corresponding functional dependency.
Given a relational schema T( 1A ,…, nA ) with key attributes { 1kA ,…, m
kA } ⊆ { 1A ,…,
nA }, the corresponding DTD D<E, A, TPM, BE, root> with FDs FDΣ can be derived
by the algorithm TableDTD(T) which is described as below:
(1) E={T_root,T} ∪ ({ 1A ,…, nA }-{ 1kA ,…, m
kA });
(2) A={@ 1A ,…,@ nA };
(3) TPM(T_root)=T*;
(4) Suppose { 1nkA ,…, p
nkA }={ 1A , … , nA }-{ 1kA , … , m
kA }, TPM(T)= 1nkA +…+ p
nkA ; for
all inkA ∈{ 1
nkA ,…, pnkA }, TPM( i
nkA )=ε ;
(5) BE (T)={@ 1kA ,…, @ m
kA }; for all inkA ∈{ 1
nkA ,…, pnkA }, BE ( i
nkA )={@ inkA };
(6) TΣ ={{T_root+T+@ 1kA ,…, T_root+T+@ m
kA } → {T_root+T+ 1nkA , T_root+T
+ 1nkA +@ 1
nkA ,…, T_root+T + pnkA , T_root+T + p
nkA +@ pnkA }};
The process above can be denoted as (D, FDΣ )= TableDTD(T). Particularly, the root
element T_root is the corresponding element of the whole table in relational schema T;
element T stands for a tuple in the relation. Moreover, it can be seen that TableDTD
maps the key attributes of relational schema to the direct attributes under element T,
and maps the non-key attributes to the XML attributes under the sub elements in T.
The key constraint is re-expressed in XML domain by (6). The FDs can be dived to be
76
two parts: {T_root+T+@ 1kA , …, T_root+T +@ m
kA } → {T_root+T+ inkA } and
{T_root+T+@ 1kA ,…, T_root+T+@ m
kA } → {T_root+T+ 1nkA +@ i
nkA }. These two
kinds of FDs mean that the key attributes in an XML will not only uniquely identify
the value of those non-key attributes but also uniquely identify the elements in the
upper level of those non-key attributes. This is slightly different from the notation in
relational schema. As it will be seen in the following sections, this property of XML
FDs is extremely essential for propagating the key constraints as well as foreign key
constraints.
As an example of key constraint propagation, the relational schema Course(cid, Name,
Lecturer) can be mapped to an XML schema D<E, A, TPM, BE, root> with a set of
FDs FDΣ as:
E={Course_root, Course, Name, Lecturer};
A={@cid, @Name, @Lecturer};
TPM(Course_root)=Course*; TPM(Course)= Name+Lecturer;
TPM(Name)=ε ; TPM(Lecturer)=ε ;
BE (Course)= {@cid}; BE (Name)={@Name}; BE (Lecturer)={@Lecturer};
FDΣ ={{Course_root+Course+@cid} → {Course_root+Course+Name,
Course_root+Course+Name+@Name, Course_root+Course+Lecturer,
Course_root+Course+Lecturer+@ Lecturer }}.
It can be seen that the construction of XML schema for a single relational schema with
key constraint is relative easy. This is partly because that the key constraint is
straightforward and doesn’t contain the nontrivial structure such as recursion and loop.
Moreover, the key constraint is the constraint within a single table and thus doesn’t
interact with other constraints. In next section, it will be seen that when dealing with
foreign key constraint which is a kind of inter-schema relationship, things will be
messy, and there will be some trick to solve the foreign key propagation.
77
A.2 Foreign key constraint propagation
Presently, each individual relational schema with key constraint has been converted to
be XML schema with FDs. However, there is seldom single table in a relational
database, and a more common situation is that tables are always connected with
foreign keys and form a set of inter-schema constraints which always give rise to more
difficult problems when we want to keep these constraints in the generated XML
schema. The following sections will mainly focus on these nontrivial situations of
foreign key constraint propagation and finally come to a formal algorithm.
A.2.1 Foreign key driven pre-mapping
In this section, the author will firstly present a partial mapping method from relational
schemas with foreign key constraints to XML schema with FDs. This is done by
following the foreign key constraints among schemas and continuously aggregating
the schemas generated by key constraint propagation to form a bigger XML schema.
However, it will be seen that this method can not preserve all of the foreign key
constraints so that there will be an “adjusting” step after this pre-mapping.
First of all, a graphical expression will be used to describe the foreign key constraints
among relational schemas.
Definition: Given relational schemas DB( 1T , 2T , …, nT ), where each iT is a single
relational schema, a foreign key graph GDB<TB, FORMAP, TBroot> is defined as
below:
(1) TB={ 1T , 2T , …, nT };
(2) FORMAP( iT )={ jT | jT contains foreign key, which is the primary key of iT , as jT ’s
own primary key or part of primary key};
(3) TBroot={ iT | ¬∃ jT ( iT ∈FORMAP( jT ))}.
For example, the relational schemas below are simple ones where the foreign key is
ISBN in relation Copy:
78
These schemas can be expressed as GDB<TB, FORMAP, TBroot>, where
(1) TB={Book, Copy};
(2) FORMAP(Book)={Copy};
(3) TBroot={Book}.
More complex examples will be given during the discussion below.
It can be seen that foreign key graph doesn’t contain the detailed information of each
schema which has been converted into the DTD graphs from each of these individual
schemas by key constraint propagation.
In the following sections, the construction of XML schema based on different kinds of
foreign key graphs will be discussed and formalized to be the pre-mapping algorithm.
A.2.2 One to many relationship and weak entity
The most commonly used foreign key is the so called “one to many” relationship,
which means that schema jT contains only the foreign key from another schema iT as
its own primary key or part of primary key. This situation is the simplest one and can
be dealt by adding the corresponding XML schema of jT as a child element of iT . For
example, the DTD graphs for Book-Copy schema can be individually designed as
Figure A-3 below following the key constraint propagation algorithm.
Figure A-3: The DTD generated by key constraint propagation from relational schema Book and Copy.
The generated key constraints within these two schemas are:
{Book_root+Book+@ISBN} → {Book_root+Book+Title, Book_root+Book+@Title,
Book(ISBN, Title, Author) Copy(ISBN, CopyNumber, Borrowed)
79
Book_root+Book+Author, Book_root+Book+Author+@Author};
{Copy_root+Copy+@ISBN, Copy_root+Copy+@CopyNumber} → {Copy_root+Copy+Borrowed, Copy_root+Copy+Borrowed+@Borrowed}.
Recall that all of the values in relation are stored in the attributes of XML, so the
elements Tile, Author, Borrowed here all have their own attributes for storing the data.
They are omitted for convenient expression.
It can be seen that relation Copy is actually a weak entity and need the foreign key
from another entity which is Book to uniquely identify a copy of a book. The emerged
DTD based on this one-to-many foreign key constraint between Book and Copy is
shown in Figure A-4:
Figure A-4: The result DTD of foreign key constraint propagation from Book-Copy relational schema. The foreign key attribute @ISBN has been removed because it can be derived from the one in Book schema.
The FDs in this DTD is modified to be:
{Book_root+Book+@ISBN} → {Book_root+Book+Title, Book_root+Book+@Title, Book_root+Book+Author, Book_root+Book+Author+@Author};
{Book_root+Book+@ISBN, Book_root+Book+Copy_root+Copy+@CopyNumber} → {Book_root+Book+Copy_root+Copy+Borrowed, Book_root+Book+Copy_root+Copy+Borrowed+@Borrowed}.
What have been done here is putting the DTD iD for relational schema iT
containing foreign key from jT as a child of the DTD jD of jT where the foreign key
comes from, extending the paths in iD ’s FDs with corresponding paths of jD and
rewriting the whole FDs in iD with the new paths. Moreover, the foreign key
attribute in iT from jT has been knocked out because they can be derived from its
80
parent iT . By doing this, the foreign key constraint is converted to a set of FDs in DTD.
For example, the previous foreign key constraint:
{Copy_root+Copy+@ISBN, Copy_root+Copy+@CopyNumber} → {Copy_root+Copy+Borrowed, Copy_root+Copy+Borrowed+@Borrowed}.
in the example above can be written as:
{Book_root+Book+@ISBN, Book_root+Book+Copy_root+Copy+@CopyNumber} → {Book_root+Book+Copy_root+Copy+Borrowed, Book_root+Book+Copy_root+Copy+Borrowed+@Borrowed}
where Book_root+Book+Copy_root+Copy+@ISBN has been knocked out from the
XML schema for relational schema Copy and is substitute by Book_root+Book+
@ISBN which is the key of relational schema Book.
An exciting result has come out: the foreign key constraints in relational schema are
changed and expressed by functional dependency in XML! The reason for this is that
the “set” value which is forbidden in relational schemas can be easily realized in XML
because of its semistructure property.
From this perspective, it seems to be easy to propagate foreign key constraint in this
sort of “one to many” relationship. However, things will be tricky if more schemas are
involved.
A.2.3 Many to many relationship
Another sort of common foreign key constraint among schemas is many-to-many
relationship which contains two or more foreign keys from other schemas. This is
depicted as the Student-Enroll-Course example:
Figure A-5: The DTD generated by key constraint propagation from relational schemas Course, Student and Enroll.
These schemas say that schema Enroll has foreign keys both from Student (Student.sid)
81
and Courses (Course.cid). Because Student-Enroll and Course-Enroll are respectively
one-to-many relationship, if the same effort is tried as it was in one-to-many situation,
a problem will be: which DTD should be chosen to contain Enroll as a child. This
problem can be described as the foreign key graph in Figure A-6. The solid arrow
means that Enroll has foreign keys from both Student and Courses. It can be seen that
these two arrows clash at relation Enroll, which causes the problem said above.
Figure A-6: The foreign key graph of Student-Enroll-Course relational schema. The solid arrows show that relation Enroll contains foreign key from Student and Course. The clashed arrows are solved by reversing one of them depicted by dashed line.
The proposed solution for this embarrass here is straightforward: break the clashed
arrows by reversing one of the arrows to the other direction, as it is shown by the
dashed arrow in Figure A-6. In this case, it is obvious that Enroll should be put in
Course as a child, and Student should be put into Enroll as a child. Problem seems to
be solved.
However, the reversion of the arrow changes the semantic of foreign key constraint.
Obviously, it is fairly abnormal to see that schema Student contains foreign key from
Enroll. To see the problem, let’s explore more about the example above.
Before emerging the XML schemas into one schema, the original key constraints of
these three individual schemas can be written as:
{Course_root+Course+@cid} → {Course_root+Course+Name, Course_root+Course+@Name, Course_root+Course+Lecturer, Course_root+Course+Lecturer+@Lecturer};
{Student_root+Student+@sid} → {Student_root+Student+Name, Student_root+Student+Name+@Name}
{Enroll_root+Enroll+@cid,Enroll_root+Enroll+@sid} → {Enroll_root+Enroll+Grade, Enroll_root+Enroll+Grade+@Grade}.
To see the problem caused by using the simple one-to-many approach, the author
constructs the DTD and an XML which is conformed to the DTD by adopting the rules
in one-to-many situation except for conserving the key in Student and knocking out
82
@sid and @cid in Enroll which contains multiple foreign keys.
In this new DTD, according to the algorithm in one-to-many relationship, the original key constraint in Student should be rewritten as:
{Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name, Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name+@Name}
Figure A-7: Left: The DTD constructed by reversing the arrow in foreign key graph. Right: the corresponding XML conformed to the DTD. The circles demonstrate the problem of reversing the arrow. The key attributes of Enroll can not uniquely identify the elements of Name which are E1 and E2.
However, is this really true? From the XML in the right of Figure A-7, it can be seen
that for path Course_root+Course+Enroll_root+Enroll+Student_root+Student+
@sid, any two cast functions having the same value ‘00001’ on it will have the same
cast result which is ‘Sb’ for path Course_root+Course+Enroll_root+Enroll+
Student_root+Student+Name+@Name. However, these casts for path Course_root+
Course+Enroll_root+Enroll+Student_root+Student+Name can either be E1 or E2 as
denoted in Figure A-7. This says that the functional dependency:
{Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name}
is no longer held after the inversion of the arrow in the foreign key graph of Figure
A-6.
This exception will cause serious problem when we update or delete the data in an
XML conformed to this sort of DTD. For example, if we modify the attribute value in
Name element denoted as E1, the element E2 doesn’t know what has happened. The
83
situation will be: two students with the same student ID will have different names,
which is obviously not consistent with the original key constraint in Student.
Because of the reasons above, the semantic of generated DTD is not consistent with
the original constraint in relational schema. The solution here is just reserving the
inconsistency and recording it by the missing FDs in the newly generated DTD in
pre-mapping step. Then, one can solve the inconsistency in the post-mapping step
which will be described later.
Because the foreign key constraint:
{Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name}
is no longer held, it is deleted as an “evidence” of the problem described above. All of
the operations above result in the FDs in generated DTD below:
{Course_root+Course+@cid} → {Course_root+Course+Name, Course_root+Course+Name+@Name, Course_root+Course+Lecturer, Course_root+Course+Lecturer+@Lecturer};
{Course_root+Course+@cid, course_root+Course+Enroll_root+Enroll+ Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+Grade, Course_root+Course+Enroll_root+Enroll+Grade+@Grade};
{Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → {Course_root+Course+Enroll_root+Enroll+ Student+Name+@Name}.
Generally, for the schema that contains more than two foreign keys from other
schemas, which is shown in the foreign key graph of Figure A-8, the solution is almost
similar: keep one of the arrows and reverse others.
Figure A-8: A more general situation of many-to-many relationship in which D contains several foreign keys from other relations. The solution of pre-mapping maintains one arrow and reverses all of others.
Figure A-9 gives the resulted structure. The details in each schema are not specified
and are exactly the same with the Student-Enroll-Course example.
84
Figure A-9: A brief expression of DTD after reversing the arrows in Figure A-8.
As a simple explanation, the reversed arrows in foreign key graph put the XML
schemas for A and C as the children of D. The reversion problem is denoted by the ill
FDs in A and C, which is similar with Student-Enroll-Course example.
A.2.4 More complex relationship
Until this point, the author has discussed two the most common situations of foreign
key constraint propagation. However, things are usually more complex than this.
Think about the scenario in Figure A-10.
Figure A-10: The foreign key graph of Researcher-Model-Experiment relational schema. Experiment has foreign key from both Researcher and Model, which forms a many-to-many relationship, and Model has foreign key from Researcher, which is a one-to-many relationship.
It says that a researcher can set up several models, and thus Model contains the key of
Researcher as a part of primary key and becomes a weak entity. This situation is
nothing more than a one-to-many relationship which has been solved before. However,
when we come to the relation Experiment, things will become complicated.
Relation Experiment says that it contains foreign key from both Researcher which
indicates who has carried out the experiment, and another foreign key from Model
which shows the model the experiment is based on. If one tries the solution for
many-to-many relationship on Researcher and Model, it will be found that it is
impossible to degrade either Researcher or Model to be the child of Experiment.
85
Particularly, if one simply reverses the arrow from Researcher to Experiment, the
foreign key graph will be recursive (a loop); while if one reverses the arrow from
Model to Experiment, the same problem of Experiment will come up with relation
Model.
The way to solve this problem here is to “split” relation Researcher into two copies
and break the circle structure in the foreign key graph in Figure A-10. Then reverse the
arrow from the replicated schema (new Researcher). The foreign key graph is
transformed to be two one-to-many relationships. This can be depicted in Figure A-11.
Figure A-11: The transformed foreign key graph by duplicating Researcher and reversing the arrow from Experiment to new Researcher.
Generally, this process contains two steps: reverse the arrow from Researcher to
Experiment, and then break the circle by splitting Researcher into two copies.
Before going on to the construction of FDs in the generated XML schema, let’s look at
a more general situation which is shown in Figure A-12.
Figure A-12: The foreign key graph of a more general situation. The relations in A series and B series propagate the foreign key constraints to the same relation C.
It can be seen that the relations in A series and B series propagate the foreign key
constraints to the same relation C. This is a more general situation of hybridized
many-to-many relationship and the complex relationship discussed above.
No matter what has happened on the foreign key constraint when coming from A0 to C,
the solution is the same: keep one arrow to C unchanged and reverse other arrows to C;
86
and then split those schemas which are pointed by the reversed arrows. The result by
doing this can be expressed as Figure A-13.
Figure A-13: The resulted foreign key graph by keeping one arrow to C unchanged, reversing other arrows to C, and splitting those schemas pointed by the reversed arrows.
The result from the solution above is obviously inconsistent with the original
semantics of the foreign key constraint, especially when the nodes are spited into two
copies. The lost information, in this, case is recorded by the ill FDs in the newly
generated DTD. Specifically, for the Researcher-Model-experiment example, if the
relational schemas are:
Researcher(rid, rname) Model(rid, mid, mname) Experiment(rid1, rid, mid, result)
where rid1 is the foreign key from Researcher and rid, mid is the foreign key from
Model. The FDs in DTD after pre-mapping the schemas of Researcher-Model-
experiment should be:
{Researcher_root+Resercher+@rid} → {Researcher_root+Resercher+rname, Researcher_root+Resercher+rname+@rname};
{Researcher_root+Resercher+@rid, Researcher_root+Resercher+Model_root+ Model+@mid} → {Researcher_root+Researcher+Model_root+Model+mname, Researcher_root+Resercher+Model_root+Model+mname+@mname};
After schema Experiment is added in, according to the one-to-many solution, the new
FDs for schemas Researcher and Model in DTD are unchanged. The FDs for
Experiment and duplicated Resercher are:
{Researcher_root+Resercher+@rid, Researcher_root+Resercher+Model_root+ Model+@mid, Researcher_root+Resercher+Model_root+Model+Experiment_root +Experiment+Researcher1_root+Researcher1+@rid} → {Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+result, Researcher_root+Resercher+Model_root+ Model+Experiment_root+Experiment+result+@result}
87
{Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+ Researcher1_root+Researcher1+@rid} → {Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+Researcher1_root+Resercher1+rname+@rname}
where Researcher1 is the duplication of Researcher.
Note that, the constraint
{Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+ Researcher1_root+Researcher1+@rid} → {Researcher_root+Resercher+Model_root+Model+Experiment_root+Experiment+Researcher1_root+Resercher1+rname}
is eliminated for the same reason to form the ill FD as it was said in many-to-many
relationship.
Because of the duplication and arrow inversion in foreign key graph, the pre-mapping
causes the same ill FDs as those in many-to-many relationship and will give rise to
insertion and updating problems. This will be solved in post-mapping step later after
next section which will formalize pre-mapping algorithm.
A.2.5 Formalized pre-mapping algorithm
Before furthering on the formal algorithm, the author defines the hierarchy of foreign
key graph which will be convenient for the following discussion.
Definition: Given a foreign key graph GDB<TB, FORMAP, TBroot> and T∈TB, the
level of T, which is denoted as level(T), is defined to be the length of the longest path
staring from the nodes in TBroot to T. Besides, level(GDB, i) is defined as
{T|level(T)=i}.
For example, in the Researcher-Model-Experiment example, level(Researcher)=0,
level(Model)=1, level(Experiment)=2.
Definition: Given a foreign key graph GDB<TB, FORMAP, TBroot>, the diameter of
GDB is defined as diam(GDB)=max{level(T)|T∈TB}.
For example, the diameter of Researcher-Model-Experiment example is 2 which is the
longest path from Researcher to Experiment in the foreign key graph.
Definition: Given a foreign key graph GDB<TB, FORMAP, TBroot>, T∈ TB, the
88
function pre(T) is defined as {T’| FORMAP(T’)=T}.
For instance, in Researcher-Model-Experiment example, pre(Researcher)=null,
pre(Model)= {Researcher}, pre(Experiment)={Researcher, Model}, where null means
an empty set.
Definition: Given two DTDs D<E, A, TPM, BE, root> with FDs FDΣ , and D’ <E’, A’, TPM’, BE’, root’> with FDs FDΣ′ , an extension of D by D’ at node e∈E with DTD
path es ∈ −D , which is denoted by DeU D’, is defined as a new DTD newD < newE ,
newA , newTPM , newBE , newroot > with FDs newFDΣ where:
(1) newE = E ∪ E’;
(2) newA = A ∪ A’;
(3) TPM(e)=TPM(e)+ troo ′ ; newTPM = TPM ∪ TPM’, where “+” means the connection of regular expression;
(4) newBE =BE ∪ BE’;
(5) newroot = root;
(6) For any DTD path s’ in FDΣ′ , s’= s’+ es ; newFDΣ = FDΣ ∪ FDΣ′ , where “+” stands for
path connection.
This is exactly the formalized description for how to add a DTD D’ as a child of
another DTD D and modify the corresponding DTD paths in FDs set. All of the
examples in previous sections were practically done with the same approach here and
thus could serve as good examples.
The following algorithm concludes the process of DTD pre-mapping with a formal
description:
Given a foreign key graph GDB<TB, FORMAP, TBroot>, the construction of the DTD
D<E, A, TPM, BE, root> with FDs FDΣ regarding GDB can be derived as:
Initialization:
E, A, TPM, BE, root are all initializaed to be empty or null.
Loop:
(1) Establish root node DB: E= E ∪ DB;
89
(2) For all kT ∈TBroot, D= DDBU TableDTD( kT ),
(3) For i=1 to diam(GDB){
Let T= level(GDB, i); //collect all relational schemas on level i
For all kT ∈T{
Let parent=pre( kT );
if the number of elements in parent is 1{
Let p∈parent;//p is the only item in parent
D’=corresponding DTD segment of p in D;
Let e=element under the root element of D’; //only one element under root
Let tempDTD=TableDTD( kT );
fka= {a| a is a foreign key attributes of kT from p};
tempDTD.A= tempDTD.A-fka; //delete the foreign key attributes
D= DeU tempDTD;
//change the DTD paths for foreign key in kT from p to the paths for key of p
In FDΣ , substitute all paths for attributes in fka with the paths of key in p;
}
else{
//insert the schema which has more than one arrow pointing to it
choose one p∈parent;
Let D’ =corresponding DTD segment of p in D;
Let e=element under the root element of D’;
Let tempDTD=TableDTD( kT );
fka= {a| a is a foreign key attribute of kT from p};
tempDTD.A= tempDTD.A-fka;
D= DeU tempDTD;
In FDΣ , substitute all paths for attributes in fka with the paths of key in p;
90
//now insert the schema to which the reversed arrows point into D as the children of kT ’ s DTD
Let D’ =corresponding DTD segment of kT in D;
Let e =element under the root element of D’;
for all p’∈parent-p{
//duplicate p’
tempDTD=TableDTD(p’);
D= DeU TableDTD(p’ );
fka= {a| a is a foreign key attribute of kT from p’};
In FDΣ , substitute all paths for attributes in fka with the paths of key in p’;
//Mark the problem caused arrow reversion and relation duplication by ill FDs
nfka= {a| a is the non-key attribute of p’};
for all a∈nfka{
//denote the ill FDs
In FDΣ , eliminate all paths containing s+a, and only keep s+a+@a;
}
}
}
}
}
Although the algorithm seems to be complex and tricky, it is nothing more than the
discussion in previous sections. It can be seen that the algorithm of pre-mapping works
on relational schemas with an increasing order of level values which is the longest path
from the root relation, and gradually aggregates all relational schemas into one XML
schema. During the process, the inversion of arrow and replication of relational
schema will cause inconsistency of the semantic between relational schema and XML
schema which will cause deletion and updating problem. These problems have been
marked by the ill FDs in XML schema and will be disposed soon in post-mapping step.
91
A.2.6 Post-mapping
For several times, we have seen that the reversion of arrows in foreign key graph and
duplication of relations schemas in foreign key graph cause the inconsistency of
constraint semantic and deletion and updating problem in pre-mapping step.
Fortunately, these problems have been recorded in the functional dependency set of the
generated DTD. This section will mainly focus on a method which can eliminate the
inconsistency caused by pre-mapping.
The inconsistency problems for a DTD D<E, A, TPM, BE, root> with FDs FDΣ is
denoted by the statement: S ⊆ D*, s’∈D*, S → s’+X+@X∈ FDΣ , but S → s’+X∉ FDΣ .
The inconsistency problems will both cause update and deletion failure as it was
described in the Course-Enroll example.
The solution to the problem can be concluded as transforming a DTD D with FDs FDΣ
into a new DTD D’ with new FDs FDΣ′ which doesn’t contain the ill FDs.
Recall that in relational database design, the update and deletion problems are solved
by splitting one schema into multiple schemas and getting rid of the redundant
attributes in one relational schema. The situation in XML is different. There is only
one XML schema for a set of relational schemas so that it is impossible to split an
XML schema into several pieces. However, XML schema contains branches. In this
case, is it possible to extract those sub-trees and put them on other braches within an
XML? The situation below is an example:
In the Course-Enroll-Student example, to catch the semantic of the key of Student,
which means that each Name and the attributes value of Name (@Name) should be
uniquely identified by @sid, the Student branch is spitted into another subtree. This
subtree is named with “Student_Info_root”. At the same time, all non-key attributes
and elements in Student branch are discarded, and only the key attributes are
conserved in order to avoid the redundancy and eliminate the ill FDs. The resulted
DTD can be expressed as the DTD graph in Figure A-14.
The Student_Info_root subtree is established before Student according to pre-mapping
algorithm because relational schema Student has 0 level value and is one of the root
92
schemas (other one is Course). The Student is indeed the duplication of relational
schema Student. Besides, root DB is added by the formalized pre-mapping algorithm.
This is why the DTD in Figure A-14 is slightly different from the one in Figure A-7.
Figure A-14: The DTD without ill FDs after post-mapping. A branch called Student_Info_root conserves all attributes of relational schema Student, while the non-key attributes in Student branch are eliminated so that the ill FD is deleted.
It can be seen that the redundant information, Name, in Student has been knocked out
so that the ill FD:
{DB+Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → { DB+Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name+@Name}
without
{DB+Course_root+Course+Enroll_root+Enroll+Student_root+Student+@sid} → { DB+Course_root+Course+Enroll_root+Enroll+Student_root+Student+Name}
disappears. The original key constraint of relational schema Student is represented by
the FD in the new branch Student_info in XML:
{DB+Student_Info_root+Student_Info+@sid} → {DB+Student_Info_root+ Student_Info+Name+@Name, DB+Student_Info_root+Student_Info+Name}.
This again catches the semantic of key constraint of Student in relational schema.
Besides, the key attribute @sid in Student and the one in Student_Info are actually
from the same attribute of Student in relational schema. To denote this fact, a new
constraint is established:
{DB+Student_Info_root+Student_Info+@sid} → {DB+Course_root+Course+Enroll_
93
root+Enroll+Student_root+Student+@sid}.
The more general many-to-many example in previous section is converted to a new
DTD in post-mapping step as the simplified DTD graph in Figure A-15. The non-key
attributes in A and C have been discarded and the new FDs, which indicate the
relationship between the key attributes in A_Info_root, C_Info_root and the retained
key attributes in the A and C, are denoted by the dashed arrows. Again, A_Info_root,
C_Info_root and DB were established in pre-mapping algorithm before coming to A
and C.
Figure A-15: The simplified DTD of the general many-to-many relational schema after post-mapping step. The non-key attributes in A and C are discarded and the new FDs, which indicate the relationship between the key attributes of A_Info_root, C_Info_root and the retained key attributes in A and C, are denoted by the dashed arrows.
Moreover, in the Researcher-Model-Experiment example, the DTD with ill FDs is
converted to a new DTD in Figure A-16. Again, the non-key attributes in the replicated
Researcher have been deleted and the new FD is expressed by the dashed arrow from
the key attributes of original Researcher_Info to the key attributes of replicated
Researcher. The root elements are all omitted for expression convenience.
The examples above show the post-mapping step for various sorts of relational design.
The formal algorithm will be given in the following discussion.
According to the algorithm of pre-mapping, because the relational schemas are added
into the generated DTD with an increasing order of level value which is the longest
path from root relational schemas, it can be proved that when the DTD segment of a
particular relational table is constructed, all of the DTD segments for the parent
relational schemas in foreign key graph have been constructed. In this case, the
94
algorithm of post mapping is actually a process of elements combination of those
duplicated subtrees. The algorithm is described as below:
Figure A-16: The simplified DTD of Researcher-Model-Experiment relational schema after post-mapping step. The non-key attributes in replicated Researcher have been deleted and the new FD is expressed by the dashed arrow from the key attributes of original Researcher_Info to the key attributes of replicated Researcher.
Given a DTD D<E, A, TPM, BE, root> with FDs FDΣ , for S ⊆ D*, s∈D*, a∈E and
@a∈A, if there is an ill FD S → {s+a+@a}∈ FDΣ but S → {s+a}∉ FDΣ , according to
the pre-mapping algorithm, there must be another two FDs S’ → {s’+a+@a},
S’ → {s’+a}∈ FDΣ , where s and s’ are the paths for duplicated subtree and original
subtree for a same relational schema respectively. Post-mapping algorithm contains
two parts: D=D/(S → {s+a+@a}) and FDΣ = FDΣ /(S → {s+a+@a}), where
D/(S → {s+a+@a}) is a new DTD D’<E’, A’, TPM’, BE’, root’> such that:
(1) E’=E;
(2) A’=A;
(3) TPM’= TPM;
TPM’(s)=TPM’(s)-a; //”-” means deleting the element with name ‘a’ from the type definition of s
(4) BE’=BE;
(5) root’=root.
and FDΣ /( S → {s+a+@a}) =( FDΣ -{S → {s+a+@a}}) ∪ {S’ → S}.
With the definition of D/(S → s+a+@a) and FDΣ /(S → s+a+@a), the post-mapping
can be easily defined as:
95
post-mapping(D, FDΣ ){
for each ill FD S → s+a+@a∈ FDΣ but S → s+a∉ FDΣ {
D= D/(S → s+a+@a);
FDΣ = FDΣ /(S → s+a+@a);
}
}
Up to now, the foreign key propagation has been solved by pre-mapping and
post-mapping algorithms. The input of the whole algorithm is a set of relational
schemas as well as a set of foreign key constraints and key constraints; and the output
of the algorithm is a DTD with a FD set which contains equivalent semantic of foreign
key constraints and key constraints in relational schemas.
The DTD will be further used as a guidance to generate XML documents from
relational databases and parse these documents for applications.
97
Appendix B Estimated databases in project
NetPro. The introduction is from: http://www.biobase.de/pages/products/netpro.html.
NetPro, the proprietary protein interaction database covers more than 100,000 expert
curated and annotated protein-protein interactions. NetPro has been built using
interaction data extracted with the proprietary information extraction engine,
M-CHIPS™ and have been cross validated through manual curation. All the
interactions are from peer reviewed published scientific literature and have gone
through significant quality checks in terms of expert cross-checking by Molecular
Connections' in-house scientific team.
The NetPro database is organized in standard relational database format with built in
front-end to effectively navigate and analyze the data. NetPro data is linked to public
Ids, LocusLink, facilitating integration of interactions information into proprietary drug
discovery databases. The protein-protein interactions in NetPro are complemented with
annotations of the profiled proteins with scientific literature information on a variety of
important subjects, including: Cellular Localization, Biological Pathways, Species,
Experimental Technique, Diseases implicated, Gene Ontology.
MINT. The introduction is from http://mint.bio.uniroma2.it/mint/ and [Zanzoni
2002]
MINT is a relational database designed to store interactions between biological
molecules. Beyond cataloguing binary complexes, MINT was conceived to store other
types of functional interactions, including enzymatic modifications of one of the
partners. Both direct and indirect relationships are considered [Zanzoni 2002].
Furthermore, MINT aims at being exhaustive in the description of the interaction and,
whenever available, information about kinetic and binding constants and about the
domains participating in the interaction is included in the entry. MINT consists of
entries extracted from the scientific literature by expert curators. The curated data can
be analyzed in the context of the high throughput data. Presently, MINT focuses on
experimentally verified protein interactions with special emphasis on proteomes from
mammalian organisms.
DIP. The introduction is from http://dip.doe-mbi.ucla.edu/.
98
The DIP database catalogs experimentally determined interactions between proteins. It
combines information from a variety of sources to create a single, consistent set of
protein-protein interactions. The data stored within the DIP database were curated, both,
manually by expert curators and also automatically using computational approaches
that utilize the knowledge about the protein-protein interaction networks extracted from
the most reliable, core subset of the DIP data.
99
Appendix C ID mapping
Because MASC (PPID), NetPro (LocusLink), MINT (SwissProt) and DIP (GenBank)
databases use different IDs, It is required to set up mapping among these IDs in order to
construct the corresponding network from those estimated databases using PPID which is
used in MASC reference network. The ID mapping is mined from various databases
including Ensembl EnsMart Genome Browser (http://www. ensembl.org/Multi/martview),
UniProtKB/Swiss-Prot (http://www.ebi.ac.uk/swissprot/), GenBank (http://www.ncbi.nlm.
nih.gov/Genbank/GenbankSearch.html), etc. The results are listed bellow. Only those PPIDs
appearing in MASC network are available here. Note that PPIDs are those 5 characters
beginning with “A”.
ID mapping from PPID to LocusLink:
A0001 2902 A0001 14810 A0002 2903 A0002 14811 A0003 2904 A0003 14812 A0007 88 A0007 11472 A0008 801 A0008 805 A0008 808 A0008 12314 A0008 12315 A0008 12313 A0009 10142 A0009 100986 A0010 6712 A0010 20742 A0011 815 A0011 12322 A0012 5335 A0012 18803 A0013 1742 A0013 13385 A0014 1740 A0014 23859 A0015 1739 A0015 13383 A0016 1741 A0016 53310 A0017 10376
A0018 6714 A0018 20779 A0020 5781 A0020 19247 A0022 2898 A0022 14806 A0024 8831 A0024 221504 A0030 2185 A0030 19229 A0033 5911 A0033 76108 A0037 5864 A0039 6857 A0039 20979 A0040 6804 A0040 20907 A0075 84258 A0075 50944 A0081 22941 A0083 2017 A0083 13043 A0086 7082 A0086 21872 A0091 71 A0095 1759 A0095 13429 A0103 2911 A0103 14816 A0104 9456 A0104 26556
A0106 18783 A0107 9229 A0112 1499 A0112 12387 A0114 493 A0117 4905 A0117 18195 A0118 84867 A0118 19259 A0119 1213 A0123 5530 A0126 2885 A0126 14784 A0132 5515 A0134 4133 A0134 17756 A0138 5894 A0138 110157 A0143 9495 A0144 5581 A0144 18754 A0145 5566 A0145 18747 A0149 4644 A0149 17918 A0152 5901 A0153 5290 A0153 18706 A0162 9479 A0162 19099 A0164 26400
A0166 27347 A0166 53416 A0168 5579 A0181 5594 A0188 26060 A0188 72993 A0194 5879 A0196 4763 A0196 18015 A0200 3265 A0200 15461 A0203 5898 A0204 2316 A0214 8997 A0215 2596 A0215 14432 A0216 7871 A0216 83997 A0217 2521 A0217 233908 A0219 1000 A0219 12558 A0222 2932 A0222 56637 A0231 10093 A0231 26140 A0231 68089 A0231 101100 A0232 10094 A0232 56378 A0233 10109
100
A0238 6812 A0238 20910 A0259 5582 A0260 5170 A0260 18607 A0262 4842 A0265 5595 A0265 26417 A0266 5604 A0266 26396 A0266 26395 A0267 5605 A0267 26396 A0267 26395 A0268 6197 A0268 110651 A0276 11113 A0276 12704 A0277 23237 A0277 11838 A0285 8927 A0285 12217 A0286 7158 A0287 1846 A0287 319520 A0288 9145 A0288 20972 A0289 5536 A0289 19060 A0290 3897 A0290 16728 A0291 1828 A0292 9118 A0292 226180 A0298 94104 A0327 2915 A0330 23236 A0330 18795 A0331 2597 A0331 14433 A0331 407972 A0333 6616 A0333 20614 A0340 476 A0340 232975 A0340 11928 A0342 478 A0350 4155 A0350 17196 A0351 4340 A0351 17441
A0352 5354 A0352 18823 A0363 5862 A0363 59021 A0364 5870 A0364 19346 A0365 326624 A0365 58222 A0366 6129 A0366 90193 A0366 19989 A0367 6137 A0367 270106 A0368 23521 A0368 22121 A0369 3945 A0369 106557 A0369 16832 A0369 16828 A0370 3939 A0370 106557 A0370 16832 A0370 16828 A0371 5052 A0371 18477 A0372 7001 A0372 21672 A0373 53616 A0375 4747 A0375 18039 A0376 4741 A0376 18040 A0377 498 A0377 11946 A0379 509 A0379 11949 A0395 291 A0396 292 A0396 11740 A0417 830 A0417 12343 A0418 832 A0418 12345 A0419 2934 A0419 227753 A0421 377 A0421 11842 A0422 6506 A0422 20511 A0423 2806 A0423 14719
A0424 2752 A0424 14645 A0425 7416 A0425 22333 A0426 7417 A0426 440866 A0426 22334 A0427 5313 A0427 18770 A0428 230 A0429 50 A0429 11429 A0430 4001 A0430 16906 A0431 4729 A0432 9588 A0432 11758 A0433 7167 A0433 21991 A0434 7422 A0434 22339 A0435 4356 A0435 13384 A0436 4355 A0436 50997 A0437 12308 A0437 12307 A0438 1627 A0438 56320 A0439 12034 A0440 9211 A0441 2171 A0441 16592 A0442 1808 A0442 12934 A0443 8604 A0443 78830 A0444 10919 A0444 110147 A0445 6252 A0445 104001 A0446 1434 A0446 110750 A0466 208 A0466 11652 A0470 121512 A0470 224014 A0473 10369 A0473 54376 A0473 12300 A0476 572
A0476 12015 A0484 3667 A0484 16367 A0488 405 A0488 11863 A0489 5602 A0489 26414 A0491 5606 A0491 26397 A0605 1737 A1851 816 A1851 12323 A1925 5211 A1925 18641 A2007 15482 A2084 55054 A2084 77040 A2331 4629 A2331 77579 A2331 71960 A2331 17880 A2343 11474 A2344 81 A2344 60595 A3166 4627 A3166 17880 A3166 17886 A3202 4625 A3202 17888 A3202 140781 A3773 79751 A3773 68267 A3958 4430 A3958 17912 A3961 4628 A3961 17880 A3961 17886 A3968 64837 A3968 16594 A3976 3983 A3976 226251 A4014 11005 A4015 167691 A4015 75782 A4048 103910 A4048 67938 A4049 4637 A4049 17904
101
ID mapping from PPID to SwissProt:
A0001 Q05586 A0001 P35439 A0001 P35438 A0002 P35436 A0002 Q12879 A0002 Q00959 A0003 Q01097 A0003 Q13224 A0003 Q00960 A0007 Q9JI91 A0007 P35609 A0008 P02593 A0009 Q9JHE0 A0009 Q99P24 A0009 Q99996 A0010 Q62261 A0010 Q01082 A0010 Q9QWN8 A0011 P11275 A0011 Q9UL21 A0011 P11798 A0012 Q62077 A0012 P19174 A0012 P10686 A0013 Q62108 A0013 P31016 A0013 P78352 A0014 Q63622 A0014 Q91XM9 A0014 Q15700 A0015 Q62696 A0015 Q12959 A0015 Q62402 A0016 Q62936 A0016 P70175 A0016 Q92796 A0017 P04687 A0017 P02551 A0018 P05480 A0018 P12931 A0018 Q9WUD9 A0020 P41499 A0020 P35235 A0020 Q06124 A0022 P39087 A0022 P42260 A0022 Q13002 A0024 Q9UGE2 A0024 Q9QUH6 A0030 Q9QVP9
A0030 P70600 A0030 Q14289 A0033 Q9D3D5 A0033 P10114 A0037 P20336 A0037 P05713 A0039 P21707 A0039 P21579 A0039 P46096 A0040 P32851 A0040 Q16623 A0040 O35526 A0075 Q9WV48 A0075 Q9Y566 A0081 Q9WUV9 A0081 Q9P1 A0083 O70420 A0083 Q60598 A0083 Q14247 A0086 Q07157 A0086 P39447 A0091 P02571 A0095 P39053 A0095 Q05193 A0095 P21575 A0103 Q9EPV6 A0103 Q13255 A0103 P23385 A0104 O96003 A0104 Q8K3E1 A0104 Q9QUJ8 A0106 P50393 A0106 P47713 A0106 P47712 A0107 Q9D415 A0107 P97836 A0107 O14490 A0112 Q02248 A0112 Q9WU82 A0112 P35222 A0114 Q64542 A0114 P23634 A0117 P46460 A0117 P46459 A0117 Q9QUL6 A0118 P54830 A0118 P54829 A0118 P35234 A0119 P11442 A0119 Q00610
A0123 P20652 A0123 Q9WUV7 A0123 Q08209 A0124 P25388 A0126 Q60631 A0126 P29354 A0132 P13353 A0132 P05323 A0134 P15146 A0134 P20357 A0134 P11137 A0138 Q99N57 A0138 P04049 A0138 P11345 A0143 P24588 A0143 P70593 A0144 Q02156 A0144 P09216 A0144 P16054 A0145 P05132 A0145 P17612 A0145 P27791 A0147 P08129 A0149 Q9QYF3 A0149 Q9Y4I1 A0149 Q99104 A0152 P17080 A0153 Q9Z1L0 A0153 P42336 A0153 P42337 A0162 Q9WVI9 A0162 Q9R237 A0162 Q9UQF2 A0164 O35406 A0164 O14733 A0166 O88506 A0166 Q9Z1W9 A0166 Q9UEW8 A0168 P04410 A0168 P04411 A0168 P05771 A0181 P27703 A0181 P28482 A0188 AAM55531 A0188 Q9G1 A0194 P15154 A0194 Q923X0 A0196 Q04690 A0196 P21359 A0196 P97526
A0200 P01112 A0200 Q61411 A0200 P20171 A0203 P11233 A0203 P05810 A0204 P21333 A0204 Q9JJ38 A0214 P97924 A0214 O60229 A0215 P07936 A0215 P06837 A0215 P17677 A0216 Q8VC86 A0216 Q9HCH1 A0217 P35637 A0217 P56959 A0219 P15116 A0219 Q9Z1Y3 A0219 P19022 A0222 P18266 A0222 P49841 A0222 Q9WV60 A0231 O15509 A0232 Q9JM76 A0232 O15145 A0233 Q9CVB6 A0233 O15144 A0238 O08599 A0238 Q64320 A0259 P05697 A0259 P05129 A0260 O55173 A0260 O15530 A0260 Q9Z2A0 A0262 P29476 A0262 P29475 A0262 Q9Z0J4 A0265 Q91YW5 A0265 P21708 A0265 P27361 A0266 Q01986 A0266 P31938 A0266 Q02750 A0267 Q63932 A0267 P36506 A0267 P36507 A0268 P18654 A0268 P51812 A0276 Q9QX19 A0276 O88938
102
A0276 O14578 A0277 Q9UJW6 A0277 Q63053 A0277 Q9WV31 A0281 Q9Y438 A0281 Q8K2R1 A0285 O43161 A0285 O88778 A0285 O88737 A0286 P70399 A0286 Q12888 A0287 Q13115 A0287 Q62767 A0288 O55100 A0288 O43759 A0288 Q62876 A0289 O35299 A0289 P53041 A0289 P53042 A0290 P32004 A0290 Q05695 A0290 P11627 A0291 Q61495 A0291 Q02413 A0292 Q16352 A0292 P23565 A0292 P46660 A0298 Q9Y5B6 A0298 P58501 A0327 P31424 A0327 P41594 A0330 P10687 A0330 Q9NQ66 A0330 Q9Z1B3 A0331 P04797 A0331 P04406 A0331 P16858 A0333 P13795 A0340 P06685 A0340 P05023 A0340 Q8VDN2 A0342 P13637 A0342 AAH37206 A0342 P06687 A0350 P02688 A0350 P04370 A0350 P02686 A0351 Q63345 A0351 Q16653 A0351 Q61885 A0352 P06905 A0363 P05712 A0363 P08886
A0363 P53994 A0364 Q9WVB1 A0364 P35279 A0364 P20340 A0365 Q9JKM7 A0365 Q96AX2 A0366 P05426 A0366 P14148 A0366 P18124 A0367 P41123 A0367 P47963 A0367 P26373 A0368 P35427 A0368 P19253 A0368 P40429 A0369 P42123 A0369 P16125 A0369 P07195 A0370 P04642 A0370 P06151 A0370 P00338 A0371 Q63716 A0371 P35700 A0371 Q06830 A0372 P35704 A0372 Q61171 A0372 P32119 A0373 Q9P0K1 A0373 Q9R1V6 A0375 P19527 A0375 P08551 A0375 P07196 A0376 P08553 A0376 P12839 A0376 P07197 A0377 P15999 A0377 Q03265 A0377 P25705 A0379 Q9ERA8 A0379 P35435 A0379 P36542 A0395 P48962 A0395 P12235 A0395 Q05962 A0396 P51881 A0396 P05141 A0396 Q09073 A0417 P47755 A0417 P47754 A0418 P47756 A0418 P47757 A0419 P13020 A0419 P06396
A0421 P16587 A0422 P43004 A0422 P43006 A0422 P31596 A0423 P00507 A0423 P05202 A0423 P00505 A0424 P09606 A0424 P15105 A0424 P15104 A0425 Q9Z2L0 A0425 Q60932 A0425 P21796 A0426 P81155 A0426 Q60930 A0426 P45880 A0427 P12928 A0427 P53657 A0427 P30613 A0428 P09117 A0428 Q9DBA4 A0428 P09972 A0429 Q9ER34 A0429 Q99KI0 A0429 Q99798 A0430 P70615 A0430 P14733 A0430 P20700 A0431 P19234 A0431 Q9D6J6 A0431 P19404 A0432 O35244 A0432 P30041 A0432 O08709 A0433 P48500 A0433 P00938 A0433 P17751 A0434 P16612 A0434 P15692 A0434 Q00731 A0435 O88954 A0435 Q13368 A0435 O88910 A0436 Q9WV34 A0436 O88953 A0436 Q14168 A0437 P47728 A0437 Q96BK4 A0437 Q08331 A0438 Q07266 A0438 Q16643 A0438 Q9QXS6 A0439 Q99623
A0439 Q61336 A0440 O95970 A0440 Q9JIA1 A0441 P55053 A0441 Q05816 A0441 Q01469 A0442 O08553 A0442 P47942 A0442 Q16555 A0443 O75746 A0444 Q9UQL8 A0444 Q9Z148 A0445 Q64548 A0445 Q16799 A0446 P55060 A0446 Q9ERK4 A0466 P47197 A0466 Q60823 A0466 P31751 A0470 Q96M96 A0470 O88387 A0470 Q91ZT5 A0473 Q9Y698 A0473 Q99PR9 A0473 O88602 A0476 O35147 A0476 Q92934 A0476 Q61337 A0484 P35569 A0484 P35568 A0484 P35570 A0488 P41739 A0488 P27540 A0488 P53762 A0489 Q61831 A0489 P49187 A0489 P53779 A0491 P46734 A0491 O09110 A0605 P08461 A0605 P10515 A0605 Q8R339 A1851 P08413 A1851 Q13554 A1851 P28652 A1925 P30835 A1925 P17858 A1925 P12382 A2007 P08107 A2007 Q07439 A2007 P17879 A2084 Q96JV5 A2084 Q9DB63
103
A2331 P35749 A2331 Q63862 A2331 O08638 A2343 Q8R4I6 A2343 O88990 A2343 Q08043 A2344 Q9QXQ0 A2344 O43707 A2344 P57780 A3166 Q8VDD5
A3166 Q62812 A3166 P35579 A3202 P02564 A3202 Q91Z83 A3202 P12883 A3773 Q9D6M3 A3773 Q9H936 A3899 Q9Z252 A3899 O88951 A3899 Q9HAP6
A3958 Q05096 A3958 P46735 A3958 O43795 A3961 P35580 A3961 Q9JLT0 A3961 Q61879 A3968 O88448 A3968 Q9H0B6 A3976 Q8K4G5 A4014 Q9D6R9
A4014 Q9NQ38 A4015 Q9BWX7 A4015 Q9D5J9 A4048 Q13182 A4048 Q9CQL8 A4048 Q63781 A4049 P16475 A4049 AAH26760 A4049 Q64119
Because DIP database used in this project contains only three species Caenorhabditis
elegans, Drosophila melanogaster and yeast. There is no corresponding MASC
network within it. In this case, the ID mapping will not be listed here.
105
Appendix D PPID and Protein Name
For convenience of reading, the PPIDs and corresponding protein names used in
MASC network are listed here:
A0001 NR1 A0002 NR2A A0003 NR2B A0007 ACTN A0008 CALM A0009 PRKA9 A0010 SPNB A0011 CAMK2A A0012 PLCg-1 A0013 DLG4 A0014 DLG2 A0015 DLG1 A0016 DLG3 A0017 Tubulin A0018 Src A0020 PTP1D A0022 GRIK2 A0024 SynGAP A0030 FAK2 A0033 Rap2 A0037 Rab3 A0039 SYT1 A0040 STX A0041 A0075 SPANK1 A0081 CortBP-1 A0083 CTTN A0086 ZO-1 A0091 ACT A0095 DNM1 A0103 mGluR1a A0104 HOMER1 A0106 cPLA2 A0107 DLGAP1 A0112 b-catenin A0114 ATP2B4 A0117 NSF A0118 PTPN5 A0119 CLTC A0123 PP2B A0124 RACK-1 A0126 GRB2 A0132 PPP2CA
A0134 MTAP2 A0138 RAF1 A0143 AKAP5 A0144 PKCepsilon A0145 PRKACA A0147 PPP1CA A0149 Myosin (V) A0152 RAN A0153 A0162 Jip-1 A0164 MKK7 A0166 HPK1 A0168 PKCbeta A0181 Erk2 A0188 APPL A0194 Rac1 A0196 NF-1 A0200 H-Ras A0203 RalA A0204 Filamin A0214 HAPIP A0215 GAP43 A0217 FUS A0219 N-cadherin A0222 GSK3 beta A0231 A0232 A0233 A0238 STXBP1 A0259 PKCgamma A0260 PDK-1 A0262 nNOS A0265 Erk1 A0266 MEK1 A0267 MEK2 A0268 Rsk-2 A0276 CIT A0277 Arg3.1 A0285 Bassoon A0286 p53BP1 A0287 MKP2 A0288 Synaptogyrin A0289 PP5
106
A0290 L1CAM A0291 DSG A0292 INA A0327 mGluR5 A0330 PLCb A0331 GAPDH A0333 SNAP25 A0340 ATP1A1 A0342 ATP1A3 A0351 MOG A0352 PLP1 A0363 Rab2 A0364 RAB6A A0365 Rab37 A0366 RPL7 A0367 RPL13 A0369 LDHB A0370 A0371 TDPX2 A0373 ADAM22 A0375 Neurofilament triplet L protein A0376 Neurofilament Triplet M A0377 ATP5A1 A0379 ATP5C A0395 SLC25A4 A0396 SLC25A5 A0417 CAPZ alpha A0418 CAPZ beta A0419 Gelsolin A0421 ARF3 A0422 SLC1A2 A0423 Asp aminotransfersase A0424 Gln synthetase A0425 MVDAC-1 A0426 VDAC-2 A0428 ALDOC A0429 Aconitase A0430 A0431 NADH-ubiquinone
oxidoreductase 24 kDa subunit, mitochondrial A0433 Triosephosphate Isomerase A0434 EGF-164 A0435 DLGH3 A0436 DLGH2 A0437 Calretinin A0438 DBN1 A0439 D-Prohibitin A0440 Leucine-rich Glioma-inactivated 1 protein A0441 E-FABP A0442 DPYSL2 A0443 SLC25A12 A0444 G9A A0445 S-REX A0446 A0466 AKT2 A0470 Frabin A0473 Stargazin A0476 Bad A0484 IRS-1 A0488 HIF-1 A0489 MAPKp49 A0491 MAP2K3 A0605 DLAT A0900 A1851 CaMKII beta A1925 Phosphofructokinase B A2007 HSPA1 A2331 MYH11 A2344 ACTN4 A3166 MYH9 A3773 GC1 A3899 VELI2 A3958 MYO1B A3961 MYH10 A3968 KLC2 A3976 ABLIM1
107
Bibliography
[Adamic 2001] L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman. (2001). Search in power-law networks. Phys. Rev. E 64, 046135.
[Albert 2000] Albert, R., Jeong, H. & Barabasi, A. L. (2000). Error and attack tolerance of complex networks. Nature 406, 378–382.
[Andrew 2005] Andrew Pocklington. (2005). Identifying functional components in biological networks. To be published.
[Armstrong 2005] J.D. Armstrong, H. Husi, M. Cumiskey, T.J. O’Dell, P.M. Visscher, R. Emes, A.J. Pocklington, Blackstock, Choudhary, and S.G.N. Grant. Complex and network analysis of synapse proteomes. (2005).To be published.
[Barabasiasi 1999] Barabasi, A.L & Albert, R. (1999). Emergence of Scaling in Random Networks Science 286, 509-512.
[Cohen 2000] R. Cohen, K. Erez, D. ben-Avraham, and S. Havlin. (2000). Resilience of the Internet to random breakdowns. Phys. Rev. Lett. 85, 4626–4628.
[Diego 2005] Diego di Bernardo, etc, 2005. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nature Biotechnology, volume 23.
[Efron 1979] B. Efron. Computers and the theory of statistics: Thinking the unthinkable. SIAM Review, 21:460-480, 1979.
[Fox 2001] Fox, J. J. & Hill, C. C. (2001) From topology to dynamics in biochemical networks. Chaos 11, 809–815.
[Freeman 1977] L. Freeman, A set of measures of centrality based upon betweenness. Sociometry 40, 35–41 (1977).
[Garey 1979] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco (1979).
[Girvan 2002] M. Girvan and M. E. J. Newman. (2002). Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 8271–8276.
[Goh 2001] K.-I. Goh, B. Kahng, and D. Kim, Universal behavior of load distribution in scale-free networks. Phys. Rev. Lett. 87, 278701 (2001).
[Grant 2004] S.G.N. Grant, H. Husi, J. Choudhary, M. Cumiskey, W. Blackstock and J.D. Armstrong, 2004. THE organization and integrative function of the post-synaptic proteome. Excitatory- Inhibitory Balance: Synapses, Circuits, Systems, 13-44.
[Ito 2000] Ito, T., Muta, K. S., Ozawa, R., Chiba, T. et al. (2000), Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci. USA 97, 1143-1147.
108
[Jeong 2001] Jeong, H., Mason, S. P., Barabási, A.-L., Oltvai, Z. N., Lethality and centrality in protein networks. Nature 2001, 411, 41–42.
[Jeong 2000] Jeong et al. (2000), The large-scale organisation of metabolic networks. Nature 407, 651-654.
[Kauffman 1969] Kauffman, S. A. (1969). Homeostasis and differentiation in random genetic control networks. Nature 224, 177–178.
[Kernighan 1970] B. W. Kernighan and S. Lin, An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal 49, 291–307 (1970).
[Kitano 2002] Kitano, H. (2002). Systems biology: a brief overview, Science 295, 1662–1664.
[Krapivsky 2001] P. L. Krapivsky., S. Redner, 2001. Organization of Growing Random Networks. Phys. Rev. E 63, 066123-1--066123-14.
[Marelo 2002] Marelo Arenas, Leonid Libkin, 2002. A normal form for XML Documents. ACM, 1-58113-507-6/02/06.
[Maslov 2002] Maslov, S., Sneppen, K. (2002). Specificity and stability in topology of protein networks, Science 296, 910–913.
[Michael 2002] Michael Benedikt, Chee Yong Chan, Wenfei Fan, etc. DTD-Directed Publishing with Attribute Translation Grammars. VLDB 2002.
[Newman 2004] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004 Feb;69(2 Pt 2):026113. Epub 2004 Feb 26.
[Newman 2003] M. E. J. Newman, Mixing patterns in networks. Phys. Rev. E 67, 026126 (2003).
[Newman 2003]2 M. E. J. Newman. A measure of betweenness centrality based on random walks. cond-mat/0309045, 2003.
[Newman 1999] M. E. J. Newman and G.T. Barkema. Mont Carlo Methods in Statistical Physics. Oxford University Press. 1999.
[Peter 2001] Peter Buneman, Susan Davidson, Wenfei Fan, etc, 2001. Keys for XML. ACM, 1-58113-348-0/01/0005.
[Savageau 1971] Savageau, M. A (1971). Parameter sensitivity as a criterion for evaluating and omparing the performance of biochemical systems. Nature 229, 542–544.
[Schwikowski 2000] Schwikowski, B., Uetz, P. and Fields, S. (2000). A network of protein-protein interactions in yeast. Nature Biotech. 18: 1257-1261.
[Scott 2000] J. Scott, Social Network Analysis: A Handbook. Sage Publications, London, 2nd edition (2000).
[Susan 2002] Susan Davidson, Wenfei Fan, etc. 2002. Propagating XML Constraints to Relations. 19th International Conference on Data Engineering, ICDE2003.
109
[Tova 1998] Tova Milo, Sagit Zohar. Using Schema Matching to Simplify Heterogeneous Data Translation. VLDB 1998.
[Vazquez 2004] Vazquez A, Dobrin R, Sergi D, Eckmann JP, Oltvai ZN, Barabasi AL. The topological relationship between the large-scale attributes and local interaction patterns of complex networks. Proc Natl Acad Sci USA.
[Watts 1998] D. J. Watts and S. H. Strogatz. (1998). Collective dynamics of ‘small-world’ networks. Nature 393, 440–442.
[Wenfei 2001] Wenfei Fan, Leonid Libkin, 2001. On XML integrity constraints in the presence of DTDs. ACM, 1-58113-361-8/01/05.
[Yannis 1996] Yannis Papakonstantinou, Serge Abiteboul, Hector Garcia-Molina. Object Fusion in Mediator Systems. VLDB 1996.
[Yi 2000] Yi, T. M., Huang, Y., Simon, M. I.&Doyle, J., (2000). Robust perfect adaptation in bacterial chemotaxis through integral feedback control, Proc. Natl. Acad. Sci. USA 97, 4649–4653.
[Zanzoni 2002] Zanzoni A, Montecchi-Palazzi, etc. (2002). MINT: a Molecular INTeraction database. FEBS Lett. 2002 Feb 20;513(1):135-40.