Post on 02-Jul-2018
transcript
From a Narrative Language for biology to
Bio-PEPA
Anastasios Andreas Georgoulas
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Master of Science
School of Informatics
University of Edinburgh
2011
Abstract
We present an algorithm for the translation of biological models from the Narrative
Language into the process algebra Bio-PEPA. The aim is to allow biologists to use
modelling methods and language familiar to them, while at the same time enjoying the
benefits of formal modelling languages, circumventing the obstacle that is their syntax.
We also describe our implementation of the algorithm and present some preliminary
results from successfully testing it on two examples. Finally, we suggest potential
improvements and extensions to this work, which may increase both the usefullness of
this project and the reach of process algebras in biology, beyond modelling experts.
i
Acknowledgements
I would first like to thank my supervisor, Maria Luisa Guerriero, for her guidance and
help during the entire time I was working on this project. I also wish to thank Stephen
Gilmore for suggesting the use of Xtext in the project, and Allan Clarke for giving me
directions on plugin development. Finally, I must extend my thanks to Jane Hillston
for her feedback and advice during the last year.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Anastasios Andreas Georgoulas)
iii
Contents
1 Introduction 11.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 32.1 Modelling in biology and formal modelling . . . . . . . . . . . . . . 3
2.2 Process algebras and Bio-PEPA . . . . . . . . . . . . . . . . . . . . 4
2.3 The Narrative Language . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Other modelling languages . . . . . . . . . . . . . . . . . . . . . . . 8
3 Translation Algorithm 93.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Getting the species involved . . . . . . . . . . . . . . . . . . 13
3.2.2 Applying the event . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Defining the reactions . . . . . . . . . . . . . . . . . . . . . 19
3.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Analysis: combinatorial explosion . . . . . . . . . . . . . . . . . . . 21
3.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Implementation 244.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Integration with the plugin . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Design issues and decisions . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Treatment of bindings and complexes . . . . . . . . . . . . . 29
iv
5 Evaluation 305.1 Test cases and procedure . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Results and comments . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Conclusions 336.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Bibliography 35
A Narrative Language Syntax 38
v
List of Figures
3.1 State transition graph for the variant-pruning optimization example . . 22
4.1 Activity diagram of the system . . . . . . . . . . . . . . . . . . . . . 25
4.2 (Simplified) class diagram of the system . . . . . . . . . . . . . . . . 26
4.3 The different event classes. Class names have been shortened from
“ProcessedEvent” etc. for the sake of presentation. . . . . . . . . . . 27
5.1 Simulation results (1000 replications) on the original Bio-PEPA enzy-
matic model (axes are molecule count vs. time) . . . . . . . . . . . . 31
5.2 Simulation results (1000 replications) on the translated Bio-PEPA en-
zymatic model (axes as above) . . . . . . . . . . . . . . . . . . . . . 32
vi
List of Tables
3.1 Affected and involved components by event type . . . . . . . . . . . 13
3.2 Substitutions for state-changing events . . . . . . . . . . . . . . . . . 18
3.3 Output by event type . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
Chapter 1
Introduction
Recent years have seen biology requiring -and benefitting from- the use of compu-
tational approaches, partly dictated by the need to deal with large volumes of data.
Not completely unrelated to this, computer scientists have been viewing biology as an
interesting field for researching new applications, inspired by the problems it offers.
The interface between the two sciences is a rich area for exploration, and indeed much
research has been taking place to investigate what collaborations it offers.
One particular area which has been extensively researched is that of modelling bio-
logical systems. The benefits of developing and using a good model are very attractive,
including saving time and funds. Moreover, a wide variety of modelling methods are
already known and well-established from previous work in other fields, and the task of
applying them to biology has proven very successful, further encouraging work in this
area.
The interdisciplinary character of the field means that experts from very different
backgrounds often need to collaborate. While this certainly offers important advan-
tages, such as permitting a problem to be considered and addressed from different
perspectives, it also imposes some restrictions. Chief among them is that communi-
cation needs to take place using a vocabulary that is mutually understood. This can
make it difficult to collaborate efficiently when one side is unfamiliar with the way in
which ideas are put forward, or does not have the necessary skills to apply the solutions
suggested.
Unfortunately, modelling does not avoid this issue. Many formal computational
approaches have been proposed, applied and tested. In spite of encouraging results,
they suffer from one important drawback: they are, for the most part, heavily mathe-
matical, and their syntax is not accessible to someone coming from a purely biological
1
Chapter 1. Introduction 2
background. This puts them at a disadvantage and severely limits their usability. If
their use is to become more widespread, steps should be taken to ensure they are able
to be used by biologists with greater ease.
1.1 Achievements
For the purposes of this project, we developed and implemented an algorithm to trans-
late a model written in the Narrative Language into Bio-PEPA. The latter is a stochastic
process algebra whose formal nature allows a model to be analysed in various ways.
The former is a language written with the express purpose of allowing biologists to
specify models easily. This effectively presents biologists with a familiar interface for
the description of systems, and at the same time makes available to them the analysis
capabilities and tools that a formal language offers, while keeping its syntax hidden.
In addition, we attempted to integrate the algorithm implementation in the existing
Bio-PEPA workbench, which is the main way of writing models in the language. This
would make the process of importing a model smoother and set the way for possible
further integration of the Narrative Language in the plugin.
1.2 Structure of the dissertation
The rest of this document is structured as follows. Chapter 2 is a brief review of biolo-
gical modelling, with particular emphasis on Bio-PEPA and the Narrative Language. A
description of the translation algorithm that we developed is found in Chapter 3, inclu-
ding the main complexity issue which inevitably arises and possible solutions. Chapter
4 contains details of the implementation of the algorithm and other design issues, while
information on testing and evaluation are included in Chapter 5. Finally, Chapter 6 has
a summary of the work done, some closing remarks and ideas for possible future work.
Chapter 2
Background
2.1 Modelling in biology and formal modelling
Modelling the behaviour of biological systems has long been a topic of interest and is
the central issue in the field of systems biology. Traditionally, biologists use a com-
bination of natural language and graphical depictions to describe biological systems.
This approach, while established, suffers from the drawback of being ambiguous. In
the absence of a set of standards, there is no unique way to interpret such diagrams,
which could result in confusion. Additionally, this lack of formality makes it difficult,
or even impossible, to process the descriptions automatically.
For these reasons, biological systems are an attractive field for the application of
formal computational modelling approaches and, indeed, many such methods have
been proposed and used. These methods can be thought of as considering the sys-
tem in question as a state machine, and the various interactions as transitions between
states. This has led to them being termed “executable models”, since the system can be
simulated or otherwise analysed by performing a “run” of the state machine [1]. This
computational view of biological processes allows the use of tools and ideas originally
developed for other areas of computer science. These can either be applied directly,
or serve as a base for the development of methods specifically suited to biological
systems.
The various approaches differ in their philosophy and, consequently, in their strengths
and weaknesses. There is no consensus on which approach is best, in part because the
suitability of each depends on the details of the specific modelling task. However, there
are some desirable properties that the ideal method would possess. First, it should be
unambiguous: a given model should allow a single interpretation. This can be achie-
3
Chapter 2. Background 4
ved, for instance, by the definition of a formal (mathematical) semantics. Secondly, it
should be executable, in the sense described earlier. Thirdly, it should allow various
types of analysis to be performed. Finally, it should offer a convenient interface for the
specification of models. This last point is particularly important when one considers
that the description language is intended to be used not only by computer scientists or
modelling experts, but also by experimental biologists.
The two methods with which this dissertation is concerned are presented in more
detail in the following sections.
2.2 Process algebras and Bio-PEPA
Process algebras (or process calculi) are formalisms originally introduced for model-
ling and studying systems of concurrent computation. The family of process algebras
is comprised of various, diverse languages, all of which share the underlying concept
of processes which can communicate with each other. A subset of those, stochastic
process algebras (SPA), additionally incorporate the element of quantified randomness
and lend themselves to simulation using well-known algorithms (e.g. [2]).
It has been suggested that the events taking place in biological systems are in some
ways similar to concurrent computation [3], and therefore process algebras (especially
stochastic ones) are particularly suited to biological modelling [4]. Much research has
taken place in applying them to this field, with promising results. Examples include
case studies using the stochastic π-calculus [5], CCS-R [6] and PEPA [7]. In these stu-
dies, a process usually serves as an abstraction for either a molecule or a biochemical
species. Similarly, communication events are used to represent biochemical reactions.
In general, process algebras offer a number of attractive features for describing
biochemical interactions. Firstly, they allow one to work at the desired level of abs-
traction. For instance, it is possible to generate a purely mathematical specification of
the system, e.g. in the form of ordinary differential equations (ODEs), that describes
the laws governing the evolution of the quantities involved, while at the same time one
can reason at a higher level about the behaviour of the system as a whole, e.g. via
model-checking [8]. The latter can also be used to verify whether the model matches
the expected behaviour (that of the real system), which can help in identifying mis-
takes. Furthermore, since process algebras are established modelling methods, these
analyses are already supported both by a theoretical background and by a number of
available tools. An additional advantage is that of compositionality, meaning that des-
Chapter 2. Background 5
criptions specified this way are modular, making them easier to combine together or
alter. In general terms, therefore, apart from offering an array of analysis methods, the
use of process algebras makes biological models more structured, less error-prone and
easier to manipulate.
Despite these advantages and encouraging results, it became apparent that some
features of biological systems were difficult to capture using existing languages. This
led to the development of new process algebras, such as Beta-binders [9], whose ex-
pressivity was designed with biological systems in mind. Development of these lan-
guages was often realised by extending existing process algebras, adapting them to the
needs of biological modelling. Another example of this is Bio-PEPA [10], adapted
from the Performance Evaluation Process Algebra (PEPA) [11].
A Bio-PEPA model consists of a number of species, which are defined in terms of
the reactions they can take part in. The syntax allows the modeller to specify both the
role of the species in the reaction (e.g. reactant, product, activator) and its stoichio-
metric coefficient. Additionally, each reaction is accompanied by a kinetic rate law,
which represents the kinetics of the corresponding biochemical reaction. Of particular
importance is the ability to specify arbitrary kinetic laws, although special provisions
are taken in the syntax for three commonly used laws (mass action, Michaelis-Menten
and Hill kinetics). This support for general kinetic laws, as well as for specifying
stoichiometry, are two of the features that make Bio-PEPA particularly suitable for
this kind of modelling, and were introduced to deal with the limitations of traditional
process algebras, as mentioned above.
In addition to the definition of species and kinetic laws, a Bio-PEPA model consists
of two more parts. The first is a list of compartment definitions. The second is the
model component, which specifies the initial state of the system (which species are
present and in what quantities), as well as the interactions between species.
As an example of the syntax, the following snippet contains the species definitions
for a system with three proteins, A, B and C, in which B can activate A and C can
deactivate it.
A_active = activation↑ + deactivation↓ ;
A_inactive = activation↓ + deactivation↑ ;
B = activation⊕ ;
C = deactivation⊕ ;
There are two reactions, named activation and deactivation, and four species. The
Chapter 2. Background 6
model component for this system, assuming that initially all molecules of A are active,
would have the following form, in which the numbers in brackets are molecular counts:
A_active[10] <*> A_inactive[0] <*> B[1] <*> C[1]
Development of Bio-PEPA models is typically done using the Bio-PEPA plugin
[12] for the Eclipse IDE, which provides a graphical interface and connections to other
tools, such as PRISM [13] for model checking or numerical ODE solvers. It also
allows exporting the model to SBML format.
2.3 The Narrative Language
Despite the effectiveness and suitability of process algebras in modelling biological
systems, their syntax effectively restricts their use to computer scientists and makes
them difficult to use by experts in the field, i.e. biologists. The Narrative Language
(NL) [14] was proposed as a formalism that would be approachable by biologists,
allowing them to use a syntax closer to natural language and a vocabulary familiar to
them.
Instead of a formal definition, the semantics of the NL was initially defined indi-
rectly, by a translation to Beta-binders [15]. This was an additional influence on the
syntax of the language, leading to the inclusion or exclusion of certain features.
A NL model consists of four types of entities. Compartments describe the different
locations of the system. Components are the biochemical species, which can hold in-
ternal states. These states are binary and can either be defined for a specific site of the
component, or be component-wide. Reactions represent the biochemical reactions, in-
cluding a kinetic rate law. Finally, events describe the sequence of interactions which
can take place in the system as well as the conditions under which they occur, thus
specifying a narrative. Every element of the model is specified as tuples of values, like
records in a database. These generally include an integer ID and/or a name, as well
as other information related to the entity’s type- for instance, compartment definitions
include a size attribute, while components specify an initial quantity and state. Defini-
tions also refer to other parts of the model: an event definition, for example, includes
a reference to a reaction, which is taken to be the underlying reaction “implementing”
the event.
Some more details should be given about events, as they define the rules of the
dynamic behaviour of the system and will be referred to very often in the next chapters.
Chapter 2. Background 7
An event’s formal description is composed of two parts. The first, which is optional,
is a list of conditions. There are two types of these: state conditions, which impose
constraints on the values of a component’s states and sites, and location conditions,
which specify the compartment in which the component is found. Both of these types
can exist in positive or negative form. An example of a positive state condition is “if A
is active”, whereas one of a negative location condition is “if A is not in 1”.
The second part is the event description, which describes a change to one or more
components. There are various types of descriptions, each with its own syntax: some
events describe a change in the state of a component, such as a phosphorylation. A
special case of this, which is treated differently here, are binding and unbinding events.
Other events describe the synthesis or degradation of a component, while others the
relocation of a component from one compartment to another.
For example, if modelled in the NL, the example system from the previous section
would contain the event descriptions:
if A is not active then B activates A
if A is active then C activates A
which would follow definitions for the components A, B and C. The definition for A,
for instance, would specify that it has a state of type “active”.
The development of the NL is still ongoing, which allowed certain flexibility in
the context of the project. Specifically, we investigated whether its expressive power
could be altered, keeping in mind what features can be expressed in Bio-PEPA. We
found that, with some rare exceptions which will be described in the following chap-
ter, all features of the original syntax can be supported when translating to Bio-PEPA,
and at the same time we took this opportunity to extend the language by adding two
features. The first is the definition of constants, which can have either integer or real
values, and can be used everywhere a numerical value can occur. The second is sup-
port for Michaelis-Menten and Hill kinetic laws, in accordance with the Bio-PEPA
predefined functions, while the original version of the NL only supported mass-action
kinetics. The full grammar of the language supported for this project can be found in
Appendix A, adapted from [15].
Chapter 2. Background 8
2.4 Other modelling languages
Petri Nets [16] are another example of a modelling formalism that had been mostly
used for computational systems before being applied to systems biology (although
they were initially conceived for concurrent systems of any kind, including biochemi-
cal). They are networks composed of two types of elements, places and transitions,
and are fully specified using a set of functions defined on these. They were introduced
for the analysis of concurrent processing systems and, as with (Bio-)PEPA, the cor-
respondence between computational and biological concepts is straightforward. Their
semantics offers a method for the stochastic simulation of the described system, similar
to that in Bio-PEPA.
Rewriting systems, such as P systems [17] and Kappa [18], model a system using
a set of rules, each of which describes a change that one or more components of the
system can undergo. In this sense, the Narrative Language is closer to these languages
than to process algebras, which are component-based rather than rule-based.
Lastly, for the sake of completeness, mention should be made of the efforts to-
wards the standardisation of graphical depictions of systems. The Systems Biology
Graphical Notation (SBGN) [19] has been introduced for that purpose, and offers a
library of symbols and a standard set of rules for their use. While neither executable
nor fully formal or unambiguous, it is an important step towards formalising graphical
descriptions.
Chapter 3
Translation Algorithm
This chapter describes the algorithm that was developed for translating a model des-
cription from the NL into Bio-PEPA. The algorithm takes as input a model written
in the NL using the syntax presented in Appendix A and outputs a Bio-PEPA model.
For the translation to be accurate, the model thus produced should be equivalent to the
original, meaning that they should exhibit the same dynamic behaviour. Additionally,
it is desirable that the resulting model be as easily readable as possible, and as similar
as possible to what someone would write if they were modelling the same system in
Bio-PEPA directly.
The algorithm described here assumes that a NL model has been extracted from
the input text file, i.e. we have access to all the parts defined therein. This obviously
requires that the input file be parsed, but, since this is more of a technical issue rather
than a part of the algorithm itself, details about the parsing stage and its implementation
are presented in the next chapter.
3.1 Validation
Before the translation itself, some preprocessing needs to take place, in order to ensure
that the translation can be completed without problems. This stage entails checking
the definitions in the model for internal inconsistencies and reporting them. A com-
ponent is deemed to be valid if its definition satisfies a set of conditions, detailed in the
following. Similarly, conditions are defined for compartments and events. In order for
a model to be valid, all its parts must be valid. All the parts are checked one by one; if
any inconsistencies are found at any time, the validation of the model fails and we do
not proceed with the translation.
9
Chapter 3. Translation Algorithm 10
In the case of components, the conditions have to do with the compartments it can
be found in. A component’s definition lists a number of compartments, each accom-
panied by a number, which indicates the percentage at which the component is found
in that compartment at the initial state. For a component to be valid, all the following
must be true:
• All the compartments referred to in the list are defined in the model.
• The percentages of the initial concentration across all listed compartments sum
to exactly 100.
A compartment is valid if the following conditions hold:
• If an external compartment is specified, it corresponds to a compartment defini-
tion in the model.
• If an external compartment is specified, it is a different compartment, that is,
circular definitions are not allowed.
• The dimension of the compartment is either two (indicating a membrane) or
three.
An event is valid if the following hold:
• The reaction specified for the event is defined in the model.
• The type of event (e.g. phosphorylation) matches the type of reaction, as speci-
fied in the reaction definition.
• If a reverse or alternative event is specified, that event is defined in the model.
• The event description is valid.
• All of its conditions are valid.
• There is at most one positive location condition (of the form “A is in 1”) for
every component.
The reasoning behind the last requirement is that a conjunction of positive location
conditions for the same component (e.g. “A is in 1 and A is in 2”) would make
an event impossible, and we would like to have excluded that possibility when proces-
sing the model.
An event description is valid if the following hold:
Chapter 3. Translation Algorithm 11
• All components referred to are defined in the model.
• If a site is referred to, the corresponding component contains the site in its defi-
nition.
• If the event is a relocation event, the new compartment is in the list of compart-
ments defined for the component.
• If the event describes a component-wide state change, the component’s definition
contains the corresponding state.
• If the event describes a site-specific state change, the site has the corresponding
state in its definition.
An event condition is valid if the following hold:
• The component mentioned is defined in the model.
• If it is a location condition, the compartment specified is in the list of compart-
ments defined for the component.
• If it is a state condition referring to a specific site, the component has the speci-
fied site and the state mentioned (e.g. active) corresponds to the one in the site
definition.
• If it is a state condition referring to a component-wide state, that state is defined
in the component definition.
It is easy to see that these conditions depend directly and in a simple way on the
definitions contained in the model, and therefore the validation can take place without
any need for analysing the dynamic behaviour of the system described.
Example 1. Consider two events with the following formal descriptions:
if A is in 2 and B is active then B phosphorylates A
if A.Y23 is active then A deactivates C
In order for the first definition to be valid, the definition of component A must
include compartment 2 in the list of compartments. Additionally, B must have a
component-wide state named (of type) “active”, and A must have a component-wide
state of type “phosphorylated”. There is a positive location condition referring to A
but it is the only one, so that poses no problem.
Chapter 3. Translation Algorithm 12
For the second definition, A must have a site named “Y23” of type “active”. Com-
ponent C must have a component-wide state of type “active”.
In both cases, the full event definition will include a reference to a reaction. This
reaction must be defined in the model and its type must correspond to the event des-
cription, i.e. the reaction mentioned by the first event must be declared as a phospho-
rylation, and the one by the second as a deactivation.
These conditions are checked one by one and, if any of them are found to be false,
the validation process stops with a negative result. �
3.2 Processing
Before the algorithm is presented, an issue must be pointed out that is the centre of the
translation process. In the NL, a component can record changes by changing the status
of its sites and component-wide states. On the other hand, Bio-PEPA species have
no way of holding internal state. This means that there is not a one-to-one mapping
from NL components to Bio-PEPA species, but rather a one-to-many correspondence.
Each of the species to which a component corresponds represents a variant of that
component, i.e. a different combination of values in its sites and states.
In order to enumerate the variants of a component, in the following we use the
naming convention explained here, which is the same one used in the implementation
of the algorithm. If a component is named A, then the names of its variants begin
with A. Then, for every component-wide state, the value of that state (e.g. “active”,
“unphosphorylated”) is appended. After that, for every site, a string of the form x:y is
appended, where x is the name of the site and y is its state (as before). All the appended
strings are separated by intermittent underscore characters (“_”). The component-wide
states are appended in alphabetical order, according to the positive value of the state
(i.e. “active” would come before “phosphorylated”, therefore “inactive” would come
before “unphosphorylated”). The strings corresponding to the sites are added in alpha-
betical order, according to the site name (i.e. “active-site:inactive” would come before
“hydrolysis-site:unhydrolysed”).
Additionally, if we need to refer to a variant which exists in a specific location, we
adopt the notation used in the Bio-PEPA plugin. This uses the form x@y, where x is
the variant’s name and y is the name of the location.
Chapter 3. Translation Algorithm 13
Type Description Affected Involved Keywords
Monomolecular A keyword A - (de)phosphorylates,
(de)hydrolyses, synthesises,
degrades
Bimolecular A keyword B B A as above, plus (de)activates
(Un)binding A (un)binds B A, B -
Table 3.1: Affected and involved components by event type
3.2.1 Getting the species involved
This is done in two steps. The first is to obtain the components that are involved in the
event, which is easily achieved by examining the event’s formal description. We dis-
tinguish between two ways in which a component can take part in an event: an affected
component is one that undergoes some change during the event, such as a state change
or binding, while an involved component is one that acts without being affected. In
biochemical terms, the affected components correspond to the reactants and products
of a reaction, and the involved ones to enzymes or inhibitors. This distinction is useful
for determining the Bio-PEPA species that participate in a reaction and their roles in
the resulting model, as will be explained. In most cases, an event will have one affec-
ted and at most one involved component, but this depends exclusively on the type of
event, as seen from its description.
For monomolecular events of the form “A keyword”, where keyword is one of
“phosphorylates”, “dephosphorylates”, “hydrolyses” or “dehydrolyses”, the only com-
ponent that takes part is A, and it undergoes the change indicated by the keyword. The-
refore, the only affected component is A, with no involved components. Bimolecular
events have the form “A keyword B”, where keyword can also be “activates” or “deac-
tivates”, in addition to the previous options. In this case, A causes a change in B but is
itself unaffected. Therefore, B is an affected and A is an involved component.
There are four types of events that are treated differently: synthesis, degradation,
binding and unbinding. The first two can, at this stage, be treated exactly as described
in the previous paragraph, depending on whether they are monomolecular (“A synthe-
sises/degrades”) or bimolecular (“A synthesises/degrades B”). Finally, for a binding
event of the form “A binds B”, both A and B are considered affected and there are no
involved components, and the same is true for unbinding events.
The above descriptions are summarised in Table 3.1.
Chapter 3. Translation Algorithm 14
Once the affected and involved components are identified, we need to obtain the
corresponding Bio-PEPA species. As mentioned above, one NL component can, in
general, correspond to more than one species in the Bio-PEPA model. However, if a
component is involved in or affected by an event, it is not necessarily the case that all
of its variants can take part in it. This is because an event description can include the
conditions under which it can occur, imposing restrictions on when the component can
take part. It is therefore these conditions that determine which variants of a component
are associated with an event.
Let getVariants(component,conditions) indicate a function which takes a com-
ponent and a list of conditions and returns a list of the variants of that component which
satisfy all the conditions in the list. Since conditions can refer either to the state or to
the location of the component, the task of identifying the variants can correspondingly
be divided into two sub-tasks.
The first is that of applying the restrictions indicated by the state conditions. This
process is denoted getStateVariants and can be achieved by iterating through all
of the component’s sites and component-wide states and checking whether a condition
exists for that site or state. If so, the corresponding value is appended to the variant
name. If no condition is present, then two new variants are created, one for each value
of the site or state. This is described in Algorithm 3.1, where “positiveValue” and
“negativeValue” denote the two possible values of a site or state- for example, for a
site of type “active”, the values are “active” and “inactive”, respectively.
Example 2. Consider a simple model with two components, A and B. A has a site x
which can be active or inactive, and the entire component can be hydrolysed or unhy-
drolysed. B can be phosphorylated or unphosphorylated. There are two compartments
in the system, cytosol (with ID 1) and exterior (with ID 2). B can be in either compart-
ment, while A can only exist in the cytosol. There is only one event, with the formal
description:
if A.x is active and B is in 1 and B is not phosphorylated then A
phosphorylates B
This is a bimolecular state-changing event, so according to Table 3.1, B is the affec-
ted and A is the involved component. We first obtain the variants of each component
specified by the state conditions, using getStateVariants. The only condition on
A is that the site x be active, but there are no constraints on the component-wide hy-
drolysis state. Therefore, there are two variants of A that can take part in the event,
and getStateVariants(A) = {A_hydrolysed_x:active, A_unhydrolysed_x:active}.
Chapter 3. Translation Algorithm 15
Algorithm 3.1 getStateVariants: Creating the variants from the state conditionsvariants := {component}
for each component-wide state donewVariants :=∅if list contains a condition for state then
for each variant ∈ variants doadd variant_condition to newVariants
end forelse
for each variant ∈ variants doadd variant_positiveValue to newVariants
add variant_negativeValue to newVariants
end forend ifvariants := newVariants
end forfor each site do
newVariants :=∅if list contains a condition for site then
for each variant ∈ variants doadd variant_site:condition to newVariants
end forelse
for each variant ∈ variants doadd variant_site:positiveValue to newVariants
add variant_site:negativeValue to newVariants
end forend ifvariants := newVariants
end forreturn variants
Chapter 3. Translation Algorithm 16
Algorithm 3.2 getCompartments: Processing the location conditionscompartments := component.compartments
for each condition ∈ conditions doif condition is positive then
return condition.compartment
elsecompartments := compartments\{condition.comparment}
end ifend for
B only has one state and is constrained by its state condition, so getStateVariants(B)
= {B_unphosphorylated}. �
The second sub-task concerns applying the location conditions. This can be done
independently of the results the previous procedure, as described here and also shown
in Algorithm 3.2. If there is a positive location condition specifying a compartment
(if one exists, then the validation process ensures that it will be unique), we return
that compartment. Otherwise, we start with a list of possible compartments from the
component’s definition, and exclude those for which a negative condition exists. In
the general case, this yields a list of compartments (containing a single element in the
case where a positive condition is present). We call the function performing this task
getCompartments.
Example 3. Continuing the previous example, we now use getCompartments for the
two components. There are no location conditions for A, so the function will return
all the compartments listed in its definition: getCompartments(A) = {cytosol}. The
condition on B is positive, so Algorithm 3.2 terminates immediately and returns the
corresponding compartment: getCompartments(B) = {cytosol}. �
We then combine the results of the two algorithms as follows. For every variant
returned by getStateVariants, we create a set of new variants, obtained by appen-
ding to it every compartment returned by getCompartments. We then return the union
of those sets (Algorithm 3.3). The Bio-PEPA syntax allows one simplification of this
procedure: if no compartment is specified after a species name, then that refers to the
species in all possible locations. Therefore, if no compartments are excluded during
Algorithm 3.2, i.e. if there are no location conditions, we simply return the variants as
obtained from Algorithm 3.1, without appending anything.
Chapter 3. Translation Algorithm 17
Algorithm 3.3 getVariants: Getting the final list of speciesstateVariants := getStateVariants(component,stateConditions)
compartments := getCompartments(component,locConditions)
if compartments = component.compartments thenreturn stateVariants
elsevariants :=∅for each variant ∈ stateVariants do
for each compartment ∈ compartments dovariants := variants∪{variant@compartment}
end forend forreturn variants
end if
Example 4. Using the results of the two previous examples, we can use Algorithm 3.3
to obtain the final list of variants. For A, we note that getCompartments returned all
possible compartments, and so we do not need to append anything to the state variants.
Thus, getVariants(A) = {A_hydrolysed_x:active, A_unhydrolysed_x:active}. For
B, getCompartments returned only the cytosol compartment. Since this is not the
full list mentioned in the definition, we will append the compartment name to the
variants returned by getStateVariants. In this case, there was only one, and so
getVariants(B) = {B_unphosphorylated@cytosol}. �
3.2.2 Applying the event
The steps so far have given us the species that act as enzymes (from the involved
components) and as reactants (from the affected components). However, we still need
to specify the products of the reaction. From the definitions in the previous sections,
we know that the products will be variants of the affected components, differing from
the variants that act as reactants to reflect the change described by the event. For an
activation event, for instance, the input (reactant) variant may be A_inactive, in which
case the output (product) would be A_active.
In general, the product species depends on the reactant. This can be seen from the
fact that an event changes a site or state of a component, but, since input variants can
differ on other sites or states, the corresponding outputs should also differ. However,
Chapter 3. Translation Algorithm 18
Event type Old state New state
Activation inactive active
Phosphorylation unphosphorylated phosphorylated
Hydrolysis unhydrolysed hydrolysed
Table 3.2: Substitutions for state-changing events
for a given reactant species and event, the product species is fully determined. The
exact method for determining the output depends on the type of the event.
For events that change the state of a component or a site (activation, phosphoryla-
tion, hydrolysis and their inverse), the mapping from input to output is straightforward.
Using the naming convention described earlier, applying the event consists of a simple
substitution in the name of the species, changing the old state to the new one, as in the
example above. The possible changes are shown in Table 3.2. For the inverse events
(e.g. deactivation), the old and new states are simply switched. In the case where an
event affects a site rather than a component-wide state, the pattern to be substituted is
site:old-state, changing to site:new-state.
Relocation events can similarly be represented as name changes. In this case, the
part of the species name that changes is the @location suffix, and it is updated to reflect
the destination compartment.
The above processes can be encapsulated in a function applyChange, which takes
a species name and an event as arguments, and returns the species name obtained after
applying the change described by the event.
The cases of degradation and synthesis events are treated differently. For the
former, there is no output. For the latter, the output should be a variant of the af-
fected component. However, the event description does not provide any conditions
for the product, and so we cannot use the getStateVariants function described
previously to get a concrete species. In light of this, we have decided to use the
component definition– specifically, use the initial state described in it. We define a
function initial which acts on a component, treating its initial state specification
as a list of conditions and returning the corresponding variant. Note that, unlike
getStateVariants, this will always return a single variant and not a list, since the
initial state specifies the value of all sites and component-wide states.
For bindings, the output species will be a complex and its name will have the form
x::y, where x and y are the input variants. Similarly, the output of an unbinding event
Chapter 3. Translation Algorithm 19
Event type Input Output
State-changing, relocation A applyChange(A)
Synthesis ∅ initial(A)
Degradation A ∅Binding A, B A::B
Table 3.3: Output by event type
can be found by splitting the name of the input variant. A more detailed analysis of
how complexes and (un)bindings are treated is contained in the following chapter.
Table 3.3 summarises the different ways of obtaining the output for the various
reaction types.
3.2.3 Defining the reactions
Once all the participating species have been determined through the above process,
the final part of the algorithm is trivial and consists mainly of combining the species
obtained in the previous steps. First, we note that if the event conditions give rise to
more than one variant for both the affected and involved components, there is more
than one way to combine them and no reason why a combination should be excluded.
For every input variant (or combination of them from different affected compo-
nents), we obtain the corresponding output variant(s). We then take every combination
of these with the variants of the involved components and, for every such combination,
we define a new Bio-PEPA reaction. That reaction will have the same kinetic law as
the reaction corresponding to the original event. We also record each species’ role wi-
thin the reaction: input variants as reactants, output ones as products and variants of
the involved components as enzymes. This can be thought of as creating a variant of
the NL reaction specified by the event.
Example 5. In the running example, we have one affected component (with one va-
riant) and one involved (with two variants). We start with the affected (input) variant,
B_unphosphorylated@cytosol. Since the event is a phosphorylation, in order to obtain
the product of the reaction we use applyChange, which results in the output variant
B_phosphorylated@cytosol. We now have to combine these with the two variants of
the involved component. This will define two Bio-PEPA reactions, both of which will
have the same reactants and products (the input and output variants referred to above,
Chapter 3. Translation Algorithm 20
respectively). The difference between the two reactions will be the enzyme, as in each
one it will be a different variant of the involved component.
If there were more variants of B, we would repeat this process, using a different
variant as input each time, finding the output and taking the combinations, as above.�
3.3 Output
Having completed the processing steps, writing the output model also becomes straight-
forward.
The first parts to be written are constant and location definitions. These will not
be affected by the translation (unlike components and reactions) and, since all the
necessary information is contained in the NL definitions for them, all that is needed is
simply converting the definitions to the Bio-PEPA syntax.
The next part of the model is the specification of kinetic laws. Each of the newly-
defined Bio-PEPA reactions has the same kinetic law as the NL reaction of which it is
a variant, and this can be retrieved from the reaction definitions in the original model.
The NL syntax we have adopted for this project supports constant rates, as well as
mass-action, Michaelis-Menten and Hill kinetics, all of which are supported in Bio-
PEPA, so we simply need to iterate over the reaction variants and use the appropriate
syntax.
Next, we write the species definitions. As we have recorded every species’ role
when defining the reaction variants, this is simply a question of iterating over the roles
associated with a species and converting them to Bio-PEPA syntax. The conversion
is trivial, since it only involves writing the name of the reaction variant and then the
symbol for the role, both of which are readily available.
Example 6. To complete the running example, assume that the two reactions defined
in the last example are named r1 and r2. The species definitions will then be:
B_unphosphorylated = r1↓@cytosol + r2↓@cytosol
B_phosphorylated = r1↑@cytosol + r2↑@cytosol
A_hydrolysed_x:active = r1⊕A_unhydrolysed_x:active = r2⊕ �
The final part is the model component. This must include all variants of every com-
ponent in the original model, as well as their initial quantities. For every component,
we first retrieve all its variants using getVariants with an empty condition list. We
Chapter 3. Translation Algorithm 21
must then combine the names of these variants with all possible compartments. The
component’s initial state definition specifies only one variant, so all the species will
have an initial quantity of zero, except for the one returned by initial, which will
have the initial quantity found in the definition.
3.4 Analysis: combinatorial explosion
Perhaps the most interesting and important characteristic of the algorithm as described
above is its complexity in terms of the number of species in the resulting model. As
described previously, one NL component can correspond to many Bio-PEPA species.
In general, a component with n states (that is, sites or component-wide states, as for
the purposes of this analysis the distinction is irrelevant) can result in 2n variants, given
that all states are binary. This exponential increase means that output models are likely
to be much larger than their corresponding input models, if size is measured as the
number of components/species.
The situation is further encumbered when it comes to reactions. We mentioned
above that every combination of the affected and involved variants must be considered,
and a new reaction defined for each of them. This means that the number of reactions
will also be exponential in the number of states. For the case of an event with one
involved component with n states and one affected component with m, if there are
no state conditions to impose, there are 2n and 2m variants which can take part in
the reaction as enzyme or reactant, respectively. Since we have to take all possible
combinations, this will result in 2n+m reaction variants being defined. While this worst-
case scenario might not be entirely realistic, as some conditions would likely exist, thus
limiting the number of variants, the number of reactions would still be exponential in
the number of free (unrestricted by conditions) states.
Unfortunately, it is not possible to completely avoid this issue. Ultimately, it is
rooted in the lack of a mechanism for specifying internal state in Bio-PEPA. This
is the reason for the one-to-many mapping from components to species, which then
propagates to the number of reactions.
However, we can attempt to reduce the number of species by analysing the model
and looking for useful properties. Informally, we can think of the transitions between
variants as forming a graph with the variants as nodes. We believe that this represen-
tation makes it easier to describe the problem and possible solutions, and also offers a
variety of tools in the form of graph theory. We present the general idea behind two
Chapter 3. Translation Algorithm 22
x:inactive
y:inactive
x:inactive
y:active
x:active
y:active
x:active
y:inactive
Figure 3.1: State transition graph for the variant-pruning optimization example
optimizations, illustrated by examples of cases in which they may be useful.
3.5 Optimizations
The first optimization involves reducing the number of species by identifying variants
which are never going to appear, as shown in the example.
Example 7. Consider the component A with two sites, x and y, both of which can be
active or inactive. Their activation is described by the following four events, which
determine the order in which activations can occur.
if A.x is not active and A.y is not active then B activates A on y
if A.x is not active and A.y is active then B activates A on x
if A.x is active and A.y is active then C deactivates A on x
if A.x is not active and A.y is active then C deactivates A on y
Examining the sequence of events that can occur, we can see that there is no way
for x to be active if y is not also active. The corresponding graph of state transitions
is shown in Figure 3.1. The variant A_x:active_y:inactive is unreachable and can the-
refore be removed, reducing the number of species derived from A to three. Any
reactions in which this species was involved can also be deleted, further reducing the
size of the output model. �
One issue to consider with this optimization is whether the nodes to be pruned
should be unreachable from anywhere or from the initial state. Choosing the latter im-
poses a less strict condition and may therefore allow to prune more variants. However,
it would mean that changing the initial state specification in a component’s definition
Chapter 3. Translation Algorithm 23
would require the translation algorithm (or at least this optimization step) to be execu-
ted again, as different variants may be reachable from the new initial state. On the other
hand, searching for nodes that are completely isolated might be too strict a restriction
for any noticeable optimization.
The idea behind the second optimization is to find groups of variants which behave
in a similar way, and then replace each group by a single variant.
Example 8. Consider a component A with three phosphorylation sites, x, y and z and
the following event descriptions:
if A.x is not phosphorylated then B phosphorylates A on x
if A.y is not phosphorylated then B phosphorylates A on y
if A.z is not phosphorylated then B phosphorylates A on z
Assume that the three sites are identical, in the sense that they are not involved
in any other reactions (or, if they are, those are exactly the same for each site) and
that all the above phosphorylation reactions occur at the same rate (or simply that the
three events are defined to have the same underlying reaction). It can be seen that the
only thing that matters is the number of phosphorylated sites, and not the specific sites
themselves. For instance, the variants
A_x:phosphorylated_y:unphosphorylated_z:unphosphorylated
A_x:unphosphorylated_y:phosphorylated_z:unphosphorylated
A_x:unphosphorylated_y:unphosphorylated_z:phosphorylated
can be considered equivalent, since they all have exactly one site phosphorylated. If
we replace them by a species A_1phosphorylated, and similarly define the species
A_0phosphorylated, A_2phosphorylated and A_3phosphorylated, the resulting model
has only four species, compared to eight if no optimization takes place. �
In the general case of a component with n identical sites, this optimization would
produce n+ 1 variants, while the original unoptimized version would contain 2n spe-
cies. This reduction from exponential to linear complexity indicates that this optimi-
zation may prove more powerful in reducing the size of the model than the first one.
However, it would require a deeper analysis of the model, and the conditions it requires
might prove too strict for it to be applied beyond a limited number of cases.
Chapter 4
Implementation
In this chapter, we give more details about the concrete implementation of the algo-
rithm presented in the previous chapter. We point out some small differences between
the theoretical description and the implemented version, as well as some issues we
came across and how they affected the implementation process.
4.1 Parsing
The first necessary step in the translation is to parse the NL model. For this project,
this was done using Xtext [20], an Eclipse plugin for the definition of domain-specific
languages, which offers several advantages. First is its ease of use: the user only needs
to describe the grammar of the language using a syntax similar to Backus-Naur Form.
In our case, therefore, this description closely resembled the contents of Appendix A,
simplifying the process of writing it. The plugin then generates code for the lexical and
syntactical analysis of files written in that language, and, once the parsing is complete,
makes available an object-oriented model of the file, which can be easily accessed by
short and simple code. Additionally, Xtext allows the user to create dynamic tools
such as editors, integrated into the Eclipse environment. While this was not realised
in the project, having this potential for future expansion was a further motive for the
selection of Xtext.
4.2 Processing
The main code for the system is written in Java, for maximum compatibility with the
already existing Bio-PEPA plugin. In general, the implementation closely follows the
24
Chapter 4. Implementation 25
Figure 4.1: Activity diagram of the system
description of Chapter 3, reflected in Figure 4.1.
The Validator class is tasked with the validation procedure which, apart from ma-
king sure the relevant conditions as laid out earlier are satisfied, also involves some
preprocessing on the model, such as building hash tables for efficient retrieval of com-
partments using their ID or name. All the results of this preprocessing are packaged
in an object of class ProcessedModel, which serves as a means of communication bet-
ween this stage and the main processing module. If any part of the NL model is found
to violate the consistency criteria, an exception is raised and the application exits. We
define the exception class NarrLangModelException for use in these situations, with
an appropriate error message detailing the reason why the model was rejected.
Chapter 4. Implementation 26
Figure 4.2: (Simplified) class diagram of the system
The next step is the processing itself, in which the main bulk of the work takes
place. The corresponding class is Transformer, and it has two tasks. The first is to
process each event in turn, and so produce a list of roles for the species involved.
These are represented by objects of type Role or its subtype, RelocationRole. The
reason for distinguishing roles in relocation reactions is their different syntax in species
definitions, which can be easily dealt with by subclassing, like we do here. Figure 4.2
shows the different classes mentioned so far.
Keeping with the different treatment of the various event types in the algorithm
description, we also have different actions based on an event’s type here. This is re-
flected in the class hierarchy shown in Figure 4.3. The idea is that, while obtaining
the involved and affected components and processing the conditions is more or less a
shared process across event types, the ways in which these species are combined and
their products are obtained are quite different. Here, we use subclasses to let an event’s
type determine the appropriate course of action for the species taking part.
Following the processing, the second task of Transformer is writing the model.
This is done as described in the algorithm, using the Role objects from the previous
step. One interesting detail has to do with the writing of the model component, during
which we take advantage of the syntax to define submodels, each comprising all the
species variants corresponding to a NL component. A long list of species is thus broken
down into smaller ones, making the model component more readable and easier to
modify.
Chapter 4. Implementation 27
Figure 4.3: The different event classes. Class names have been shortened from “Pro-
cessedEvent” etc. for the sake of presentation.
4.3 Integration with the plugin
The original plan for the project was to integrate the translation application in the
existing Bio-PEPA plugin. The addition would contribute a new menu item, which
would bring up a simple dialog. From there, the user would be able to browse for a
NL file and select the location where the output file would be created. The process
would entail the creation of two new classes, one to serve as delegate for the action
of selecting the menu item, and another for the simple dialog wizard that would be
brought up by that action and, finally, some minor edits to an XML file describing the
plugin to include the addition.
Unfortunately, due to time limitations and the additional need to become familiar
with the Eclipse plugin development environment, this final part of the project could
not be completed. It should be stressed, however, that it is only the graphical inter-
face and integration that were not realised, and that the translation application itself is
functional.
4.4 Design issues and decisions
In Section 3.1, we presented the requirements that a model must fulfil in order to
be considered valid for the purposes of this project. However, as presented there,
these rules allow for various constructions which may seem unintuitive or biologically
Chapter 4. Implementation 28
unsound. As an example, an event in which two components interact even though they
are in different compartments is considered valid under these criteria. Moreover, no
steps are taken during the processing of the event to enforce that the components must
be in the same compartment (e.g. by only considering the intersection of their possible
locations).
This was a conscious design decision on our part: we chose to allow the modeller
the highest degree of flexibility (within the limits of the syntax and insofar as no logi-
cal contradictions or inconsistencies occur). This has the obvious side-effect of making
the process more error-prone, as fewer checks are in place to secure the validity of the
model. Nevertheless, we believe that, overall, our decision is not detrimental to the
quality of the translation system. Firstly, a very serious modelling error may well be
obvious or at least easy to detect from analysing the resulting Bio-PEPA model, there-
fore our approach does not necessarily mean errors will go unnoticed. Furthermore, the
additional flexibility may prove to be useful in cases where a higher-level abstraction
is desired–for instance, if someone wishes to omit (possibly unknown) intermediate
reactions, which when removed result in the image of components interacting from
different locations, as mentioned earlier.
Another example of this is that we allow components that do not directly take
part in a reaction to influence it indirectly. Thus, an event’s conditions may involve
a component even if it does not act as a reactant or enzyme, such as in the event
if A is phosphorylated, B binds C. In this case, the variants of A specified by
the condition will be represented in the corresponding Bio-PEPA reactions as generic
modifiers–a role that means they must be present for the reaction to occur, but they do
not affect its rate. It is not immediately clear to us in what biochemical context such an
event might arise, except perhaps when considering abstractions as described above,
but the fact that it could be expressed in Bio-PEPA made us allow it for the translation.
On the other hand, in some cases we were restricted by the Bio-PEPA syntax in
what we could express. Such is the case of relocation reactions, for which the current
syntax does not allow the definition of modifiers (including enzymes). We therefore
have to ignore any conditions on non-participating species, which would have other-
wise been represented as generic modifiers like in the previous paragraph. This also
means that the use of some kinetic laws is not allowed, as it requires that a species
be defined as an enzyme for the reaction. Therefore, for the translation, all relocation
reactions must either have a constant rate or follow mass-action kinetics.
Chapter 4. Implementation 29
4.4.1 Treatment of bindings and complexes
One part of the implementation worth focusing on is the treatment of bindings. As
mentioned in the previous chapter, we decided to represent a complex as a single spe-
cies with a name of the form A::B, rather than two species with their bound states ac-
tive. This representation is closer to what one would do if one was modelling directly
in Bio-PEPA, and makes the output model more readable. However, it introduced a
number of complications, because binding sites and the conditions on them now had
to be treated completely differently than other types of sites. For instance, if A has a
binding site and can bind to B or C, then the condition “if A.x is active” might
not represent only the species A_x:active, but also A_x:active::B and A_x:active::C,
or even more if B and C have more that one variant.
In order to avoid confusion, we decided to make the following assumption: un-
less there is explicit mention of a component being bound somewhere in an event’s
description, that component is assumed to be free (i.e. unbound). This means that,
unless there is a state condition that forces the component to be bound, or the com-
ponent participates in the event as part of a complex, then we do not consider what
other components it might be bound to.
Chapter 5
Evaluation
5.1 Test cases and procedure
Testing of the project’s performance occurred in two stages. In the first, while the
code was still being developed, the aim was mostly to locate bugs. For that reason, we
used a number of ad-hoc examples, chosen not because of their biological significance
(since they had none) but in order to test the system under a variety of event types and
make sure it responded correctly.
Once the implementation was finished and adequately tested, we chose two example
models on which to evaluate its behaviour. The evaluation criteria we chose are related
to the task for which the system is designed, keeping in mind that the translation is a
one-off procedure, but the result might be used multiple times. Therefore, we chose
not to measure the execution time of the application, but rather to focus on the charac-
teristics of the output. Specifically, we wanted to see how easy the output model is to
read, and how close it is to something written directly in Bio-PEPA. Obviously, these
two “metrics” are very subjective and difficult (if not impossible) to measure. Howe-
ver, we believe that these aspects are more important than the temporal performance of
the algorithm, given what its purpose is. Additionally, for each example, we compared
the dynamic behaviour of the translated system to that of the “original” Bio-PEPA one,
to make sure that they match.
The two models that we used for evaluation are relatively simple but are often used
as examples for modelling. The first is a model of a generic enzymatic reaction with
two steps, which was at the same time a good way to test how the implementation
handles binding events. The second models part of the MAPK cascade [21], involving
three proteins, each of which activates the next.
30
Chapter 5. Evaluation 31
Figure 5.1: Simulation results (1000 replications) on the original Bio-PEPA enzymatic
model (axes are molecule count vs. time)
5.2 Results and comments
The conclusions we drew for both examples are similar. We ran a stochastic simulation
on the original and the translated model, and verified that their behaviours are identical,
as can be seen from Figures 5.1 and 5.2. In this case, there is a one-to-one correspon-
dence between the species of the two models, and we could say that the translation is
quite near to the “natural” model.
The translated model is inherently harder to read, for at least two reasons which
have to do with the translation procedure. The first is that, during the translations, the
reactions are given alpharithmetic names indicating which NL reaction they originated
from. While systematic, this is not very helpful when reading the model and trying to
understand the function of a reaction from the species involved. It is clearly easier to do
it when one has the freedom to name reactions according to their purpose, something
which is impossible in the algorithm.
The other reason is that the names of the variants generated by the algorithm tend
to be long, especially for components with many sites or, worse, when they are bound.
Compare, for instance, the names in Figures 5.1 and 5.2. This makes the model look
more encumbered, especially in some cases where the species name has to be repeated
in its definition, leading to definitions that can become hard to understand or manipu-
late. An argument could be made that, even in the original Bio-PEPA model, this could
not be avoided if it is necessary to distinguish between different states of the same spe-
Chapter 5. Evaluation 32
Figure 5.2: Simulation results (1000 replications) on the translated Bio-PEPA enzymatic
model (axes as above)
cies, as the names could once again end up being variations with a similar prefix. In
any case, although the naming convention we have adopted leads to particularly long
species names, at least it makes it clear what each species represents, since the state is
directly reflected in the name. In this sense, it has perhaps the opposite effect of the
reaction naming algorithm.
In the case of the MAPK example, the number of species was also the same bet-
ween the two models (it was, however, double the number of NL components, as ex-
pected, since all components had one binary state). The small scale of the two example
systems did not lead to huge absolute increases in the number of species from the trans-
lation.
Chapter 6
Conclusions
We developed an algorithm for translating models written in the Narrative Language
into Bio-PEPA, and implemented it in a Java application. Although we did not manage
to integrate the application into the Bio-PEPA plugin, test results show that it succeeds
in capturing the behaviour described in the Narrative Language model. Additionally, in
the two test cases we analysed, the resulting models appeared to be comprehensible and
easy to read, although not as much as the ones written directly in Bio-PEPA. However,
they were very similar to them, which we can take as an encouraging sign for the
usefulness of this project.
In our view, the greatest problem with the algorithm is the combinatorial explosion
in the number of species during the translation. Even though we did not implement any
optimizations to the algorithm to deal with the issue, this was not a necessary part of
the original proposal. Furthermore, we believe the optimization ideas proposed earlier
in the discussion merit further investigation, and we reiterate that the problem cannot
be completely solved, as it is inherent in the current syntax of the language.
6.1 Future work
There are a number of ways in which the work described here can be extended. Firstly,
we believe that there is potential in exploring optimizations to the algorithm in order
to address the combinatorial issues. A first approach would therefore be to implement
a more sophisticated version of the algorithm, using the optimizations described in
Section 3.5. This would require researching how effective they are and in which cases
they can be applied with significant results. Another possibility that could be investi-
gated is the use of model-checking, instead of graph theory, to check for properties of
33
Chapter 6. Conclusions 34
the system. This line of research could perhaps shed light on particular aspects of the
problem and lead to the development of further optimizations.
An alternative approach for dealing with this, at least on a more superficial level,
would be to extend the Bio-PEPA syntax to include some form of denoting internal
state. This would be in the same spirit as the extension of the language to support
locations more extensively [22]. This included special syntax for the definition of
compartments and membranes, as well as for relocation of species, which led to models
being more readable and compact. However, the extension is provided only to help the
modeller and these descriptions are then automatically converted to the “traditional”
syntax. Similar “syntactic sugar” extensions could be introduced for specifying states
for species. It may also be interesting to look into extending the “true” syntax, i.e.
supporting such state qualifications without needing to map them to the original syntax.
A different direction in which this work can be extended would be to widen the
range of input languages supported. For instance, an approach similar to the this could
perhaps be applied in order to translate models from other rule-based formalisms, such
as rewriting systems. A more ambitious project would be to accept a model written
in something resembling, or at least closer to, actual natural language. Similar work,
although in a different scientific context, that of electrical systems, has been done in
[23]. The current algorithm can also be updated in order to include features that are
introduced in future versions of the NL.
Finally, on the implementation level, the integration with the Bio-PEPA plugin
could be furthered by developing an editor or other graphical tools for the specification
of models in the NL from within the plugin. The definition of the NL grammar in
Xtext, which was realised as part of the project, would serve as a useful basis for the
development of these tools.
Bibliography
[1] Jasmin Fisher and Thomas A Henzinger. Executable cell biology. Nat Biotech,
25(11):1239–1249, 2007.
[2] Daniel T. Gillespie. Stochastic Simulation of Chemical Kinetics. Annual Review
of Physical Chemistry, 58:35–55, 2007.
[3] Aviv Regev and Ehud Shapiro. Cells as Computation. Nature, 419(6905):343,
2002.
[4] Federica Ciocchetta and Jane Hillston. Process Algebras in Systems Biology. In
SFM’08, volume 5016 of LNCS, pages 265–312. Springer-Verlag, 2008.
[5] C. Priami, A. Regev, W. Silverman, and E. Shapiro. Application of a stochastic
name-passing calculus to representation and simulation of molecular processes.
Information Processing Letters, 80(1):25–31, 2001.
[6] Vincent Danos and Jean Krivine. Formal Molecular Biology Done in CCS-R.
Electronic Notes in Theoretical Computer Science, 180(3):31 – 49, 2007. Procee-
dings of the First Workshop on Concurrent Models in Molecular Biology (Bio-
Concur 2003).
[7] Muffy Calder, Stephen Gilmore, and Jane Hillston. Modelling the influence of
RKIP on the ERK signalling pathway using the stochastic process algebra PEPA.
In Transactions on Computational Systems Biology VII, number 4230 in LNCS.
Springer, 1–23 2006.
[8] Marta Kwiatkowska, Gethin Norman, and David Parker. Using probabilistic mo-
del checking in systems biology. ACM SIGMETRICS Performance Evaluation
Review, 35(4):14–21, 2008.
[9] Corrado Priami and Paola Quaglia. Operational patterns in Beta-binders. Tran-
sactions on Computational Systems Biology, 1:50–65, 2005.
35
Bibliography 36
[10] Federica Ciocchetta and Jane Hillston. Bio-PEPA: A framework for the model-
ling and analysis of biological systems. Theoretical Computer Science, 410(33-
34):3065 – 3084, 2009.
[11] Jane Hillston. A Compositional Approach to Performance Modelling. Cambridge
University Press, 1996.
[12] The Bio-PEPA Eclipse Plugin. http://homepages.inf.ed.ac.uk/s9552712/
bio-pepa/plugin.html.
[13] Andrew Hinton, Marta Kwiatkowska, Gethin Norman, and David Parker.
PRISM: A tool for automatic verification of probabilistic systems. In Proc. of
TACAS’06, volume 3920 of LNCS, pages 441–444, 2006.
[14] Maria Luisa Guerriero. From Intuitive Descriptions of Biochemical Systems to
Their Formal Analysis. PhD thesis, ICT School - DIT - University of Trento,
2007.
[15] Maria Luisa Guerriero, John K. Heath, and Corrado Priami. An Automated
Translation from a Narrative Language for Biological Modelling into Process Al-
gebra. In Proceedings of Computational Methods in Systems Biology (CMSB’07),
volume 4695 of LNCS, pages 136–151. Springer, 2007.
[16] Monika Heiner, David Gilbert, and Robin Donaldson. Petri Nets for Systems and
Synthetic Biology. In SFM’08, volume 5016 of LNCS, pages 215–264. Springer,
2008.
[17] Mario J. Pérez-Jiménez and Francisco J. Romero-Campero. P Systems, a New
Computational Modelling Tool for Systems Biology. Transactions on Computa-
tional Systems Biology, 6:176–197, 2006.
[18] Vincent Danos, Jérôme Feret, Walter Fontana, Russell Harmer, and Jean Krivine.
Rule-Based Modelling of Cellular Signalling. In CONCUR 2007 - Concurrency
Theory, volume 4703 of Lecture Notes in Computer Science, pages 17–41. Sprin-
ger Berlin / Heidelberg, 2007.
[19] Nicolas Le Novere, Michael Hucka, Huaiyu Mi, Stuart Moodie, Falk Schrei-
ber, Anatoly Sorokin, Emek Demir, Katja Wegner, Mirit I Aladjem, Sarala M
Wimalaratne, Frank T Bergman, Ralph Gauges, Peter Ghazal, Hideya Kawaji,
Bibliography 37
Lu Li, Yukiko Matsuoka, Alice Villeger, Sarah E Boyd, Laurence Calzone, Me-
lanie Courtot, Ugur Dogrusoz, Tom C Freeman, Akira Funahashi, Samik Ghosh,
Akiya Jouraku, Sohyoung Kim, Fedor Kolpakov, Augustin Luna, Sven Sahle, Es-
ther Schmidt, Steven Watterson, Guanming Wu, Igor Goryanin, Douglas B Kell,
Chris Sander, Herbert Sauro, Jacky L Snoep, Kurt Kohn, and Hiroaki Kitano.
The Systems Biology Graphical Notation. Nature Biotechnology, 27(8):735–741,
2009.
[20] Xtext. http://www.eclipse.org/Xtext/.
[21] Rony Seger and Edwin G. Krebs. The mapk signaling cascade. The FASEB
Journal, 9(9):726–735, 1995.
[22] Federica Ciocchetta and Maria Luisa Guerriero. Modelling Biological Compart-
ments in Bio-PEPA. In Proc. of MeCBIC’08, volume 227 of ENTCS, pages 77–
95. Elsevier, 2009.
[23] Alexander Holt and Ewan Klein. A semantically-derived subset of english for
hardware verification. In Proceedings of the 37th annual meeting of the Asso-
ciation for Computational Linguistics on Computational Linguistics, ACL ’99,
pages 451–456, 1999.
38
Appendix A. Narrative Language Syntax 39
Appendix A
Narrative Language Syntax
〈model〉 ::= 〈constants_decl〉〈comparts_decl〉〈compons_decl〉〈reacts_decl〉〈procs_decl〉
〈constants_decl〉 ::= Constants 〈constants_list〉〈comparts_decl〉 ::= Compartments 〈comparts_list〉〈compons_decl〉 ::= Components 〈compons_list〉〈reacts_decl〉 ::= Reactions 〈reacts_list〉〈procs_decl〉 ::= Narrative 〈procs_list〉
〈constants_list〉 ::= 〈constant〉| 〈constant〉〈constants_list〉
〈comparts_list〉 ::= 〈compartment〉| 〈compartment〉〈comparts_list〉
〈compons_list〉 ::= 〈component〉| 〈component〉〈compons_list〉
〈reacts_list〉 ::= 〈reaction〉| 〈reaction〉〈reacts_list〉
〈procs_list〉 ::= 〈proc〉| 〈proc〉〈procs_list〉
〈constant〉 ::= (〈const〉,〈quantity〉)〈compartment〉 ::= (〈id〉,〈compart_name〉,〈opt_size〉,〈opt_unit〉,〈opt_dim〉)〈component〉 ::= (〈name〉,〈opt_in f orm_descr〉,〈opt_sites_de f 〉,
〈opt_states_de f 〉,〈opt_comparts_de f 〉,〈initial_quantity〉)〈reaction〉 ::= (〈id〉,〈react_type〉,〈rate〉)
〈proc〉 ::= Process 〈opt_in f orm_descr〉〈events_list〉〈events_list〉 ::= 〈event〉
| 〈event〉〈events_list〉〈event〉 ::= (〈id〉,〈 f orm_descr〉,〈react_id〉,〈opt_altern_event〉)
〈opt_sites_de f 〉 ::=
| 〈sites_de f 〉〈sites_de f 〉 ::= 〈site_de f 〉
| 〈site_de f 〉;〈sites_de f 〉〈site_de f 〉 ::= 〈name〉 : 〈state_name〉 : 〈is_active〉
〈opt_states_de f 〉 ::=
| 〈states_de f 〉〈states_de f 〉 ::= 〈state_de f 〉
| 〈state_de f 〉;〈states_de f 〉〈state_de f 〉 ::= 〈state_name〉 : 〈is_active〉
Appendix A. Narrative Language Syntax 40
〈opt_comparts_de f 〉 ::=
| 〈comparts_de f 〉〈comparts_de f 〉 ::= 〈compart_de f 〉
| 〈compart_de f 〉;〈comparts_de f 〉〈compart_de f 〉 ::= 〈id〉 : 〈is_active〉
〈initial_quantity〉 ::= (〈quantity〉,〈opt_reliability〉)〈rate〉 ::= rate_const
| rate_law
〈rate_const〉 ::= (〈rate_value〉,〈opt_unit〉,〈opt_reliability〉)〈rate_law〉 ::= fMA(quantity)
| fMM(quantity,quantity)
| fH(quantity,quantity, Int)
〈 f orm_descr〉 ::= 〈event_descr〉| if 〈conds〉 then 〈event_descr〉
〈conds〉 ::= 〈cond〉| 〈cond〉 and 〈conds〉
〈cond〉 ::= 〈names〉 is 〈state_name〉| 〈names〉 is not 〈state_name〉| 〈names〉 is in 〈id〉| 〈names〉 is not in 〈id〉
〈names〉 ::= 〈name〉| 〈name〉.〈name〉| 〈name〉;〈names〉| 〈name〉.〈name〉;〈names〉
〈sites〉 ::= 〈name〉| 〈name〉;〈sites〉
〈event_descr〉 ::= 〈complex_name〉〈bimol_react〉〈complex_name〉 on 〈sites〉| 〈complex_name〉〈bimol_react〉〈complex_name〉| 〈complex_name〉〈monomol_react〉 on 〈sites〉| 〈complex_name〉〈monomol_react〉| 〈complex_name〉 relocates to 〈id〉| 〈complex_name〉 degrades
| 〈complex_name〉 degrades 〈complex_name〉| 〈complex_name〉 synthesises 〈complex_name〉| 〈complex_name〉 homodimerizes
| 〈complex_name〉 dehomodimerizes
| 〈complex_name〉 dimerizes with 〈complex_name〉| 〈complex_name〉 dedimerizes from 〈complex_name〉
Appendix A. Narrative Language Syntax 41
〈complex_name〉 ::= 〈name〉| 〈name〉 : 〈complex_name〉
〈id〉 ::= Int
〈opt_size〉 ::=
| Int|const
〈opt_unit〉 ::=
| Str
〈opt_dim〉 ::=
| Int
〈name〉 ::= Ide
〈opt_in f orm_descr〉 ::=
| Str
〈quantity〉 ::= value | const
〈value〉 ::= Int | Real
〈const〉 ::= Ide
〈opt_reliability〉 ::=
| Int
〈rate_value〉 ::= quantity
〈react_id〉 ::= Int
〈opt_altern_event〉 ::=
| alternative to 〈id〉〈is_active〉 ::= Bool
〈compart_name〉 ::= nucleus | cytosol | exosol
| cellMembrane | nucleusMembrane | Ide
〈react_type〉 ::= phosphorylation | dephosphorylation
| binding | unbinding
| homodimerization | dehomodimerization
| dimerization | dedimerization
| activation | deactivation
| hydrolysis | dehydrolysis
| degradation | synthesis | relocation
〈state_name〉 ::= phosphorylated | bound | active | hydrolysed | dimer
〈bimol_react〉 ::= phosphorylates | dephosphorylates | binds | unbinds
| activates | deactivates | hydrolyses | dehydrolyses
〈monomol_react〉 ::= phosphorylates | dephosphorylates | hydrolyses | dehydrolyses
Appendix A. Narrative Language Syntax 42