From a Narrative Language for biology to Bio-PEPA · From a Narrative Language for biology to...

transcript

From a Narrative Language for biology to

Bio-PEPA

Anastasios Andreas Georgoulas

U N I V E RS

ED I N B U

Master of Science

School of Informatics

University of Edinburgh

Abstract

We present an algorithm for the translation of biological models from the Narrative

Language into the process algebra Bio-PEPA. The aim is to allow biologists to use

modelling methods and language familiar to them, while at the same time enjoying the

benefits of formal modelling languages, circumventing the obstacle that is their syntax.

We also describe our implementation of the algorithm and present some preliminary

results from successfully testing it on two examples. Finally, we suggest potential

improvements and extensions to this work, which may increase both the usefullness of

this project and the reach of process algebras in biology, beyond modelling experts.

Acknowledgements

I would first like to thank my supervisor, Maria Luisa Guerriero, for her guidance and

help during the entire time I was working on this project. I also wish to thank Stephen

Gilmore for suggesting the use of Xtext in the project, and Allan Clarke for giving me

directions on plugin development. Finally, I must extend my thanks to Jane Hillston

for her feedback and advice during the last year.

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Anastasios Andreas Georgoulas)

Contents

1 Introduction 11.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Modelling in biology and formal modelling . . . . . . . . . . . . . . 3

2.2 Process algebras and Bio-PEPA . . . . . . . . . . . . . . . . . . . . 4

2.3 The Narrative Language . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Other modelling languages . . . . . . . . . . . . . . . . . . . . . . . 8

3 Translation Algorithm 93.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 Getting the species involved . . . . . . . . . . . . . . . . . . 13

3.2.2 Applying the event . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.3 Defining the reactions . . . . . . . . . . . . . . . . . . . . . 19

3.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Analysis: combinatorial explosion . . . . . . . . . . . . . . . . . . . 21

3.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Implementation 244.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Integration with the plugin . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Design issues and decisions . . . . . . . . . . . . . . . . . . . . . . . 27

4.4.1 Treatment of bindings and complexes . . . . . . . . . . . . . 29

5 Evaluation 305.1 Test cases and procedure . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Results and comments . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Conclusions 336.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Bibliography 35

A Narrative Language Syntax 38

List of Figures

3.1 State transition graph for the variant-pruning optimization example . . 22

4.1 Activity diagram of the system . . . . . . . . . . . . . . . . . . . . . 25

4.2 (Simplified) class diagram of the system . . . . . . . . . . . . . . . . 26

4.3 The different event classes. Class names have been shortened from

“ProcessedEvent” etc. for the sake of presentation. . . . . . . . . . . 27

5.1 Simulation results (1000 replications) on the original Bio-PEPA enzy-

matic model (axes are molecule count vs. time) . . . . . . . . . . . . 31

5.2 Simulation results (1000 replications) on the translated Bio-PEPA en-

zymatic model (axes as above) . . . . . . . . . . . . . . . . . . . . . 32

List of Tables

3.1 Affected and involved components by event type . . . . . . . . . . . 13

3.2 Substitutions for state-changing events . . . . . . . . . . . . . . . . . 18

3.3 Output by event type . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Chapter 1

Introduction

Recent years have seen biology requiring -and benefitting from- the use of compu-

tational approaches, partly dictated by the need to deal with large volumes of data.

Not completely unrelated to this, computer scientists have been viewing biology as an

interesting field for researching new applications, inspired by the problems it offers.

The interface between the two sciences is a rich area for exploration, and indeed much

research has been taking place to investigate what collaborations it offers.

One particular area which has been extensively researched is that of modelling bio-

logical systems. The benefits of developing and using a good model are very attractive,

including saving time and funds. Moreover, a wide variety of modelling methods are

already known and well-established from previous work in other fields, and the task of

applying them to biology has proven very successful, further encouraging work in this

The interdisciplinary character of the field means that experts from very different

backgrounds often need to collaborate. While this certainly offers important advan-

tages, such as permitting a problem to be considered and addressed from different

perspectives, it also imposes some restrictions. Chief among them is that communi-

cation needs to take place using a vocabulary that is mutually understood. This can

make it difficult to collaborate efficiently when one side is unfamiliar with the way in

which ideas are put forward, or does not have the necessary skills to apply the solutions

suggested.

Unfortunately, modelling does not avoid this issue. Many formal computational

approaches have been proposed, applied and tested. In spite of encouraging results,

they suffer from one important drawback: they are, for the most part, heavily mathe-

matical, and their syntax is not accessible to someone coming from a purely biological

Chapter 1. Introduction 2

background. This puts them at a disadvantage and severely limits their usability. If

their use is to become more widespread, steps should be taken to ensure they are able

to be used by biologists with greater ease.

1.1 Achievements

For the purposes of this project, we developed and implemented an algorithm to trans-

late a model written in the Narrative Language into Bio-PEPA. The latter is a stochastic

process algebra whose formal nature allows a model to be analysed in various ways.

The former is a language written with the express purpose of allowing biologists to

specify models easily. This effectively presents biologists with a familiar interface for

the description of systems, and at the same time makes available to them the analysis

capabilities and tools that a formal language offers, while keeping its syntax hidden.

In addition, we attempted to integrate the algorithm implementation in the existing

Bio-PEPA workbench, which is the main way of writing models in the language. This

would make the process of importing a model smoother and set the way for possible

further integration of the Narrative Language in the plugin.

1.2 Structure of the dissertation

The rest of this document is structured as follows. Chapter 2 is a brief review of biolo-

gical modelling, with particular emphasis on Bio-PEPA and the Narrative Language. A

description of the translation algorithm that we developed is found in Chapter 3, inclu-

ding the main complexity issue which inevitably arises and possible solutions. Chapter

4 contains details of the implementation of the algorithm and other design issues, while

information on testing and evaluation are included in Chapter 5. Finally, Chapter 6 has

a summary of the work done, some closing remarks and ideas for possible future work.

Chapter 2

Background

2.1 Modelling in biology and formal modelling

Modelling the behaviour of biological systems has long been a topic of interest and is

the central issue in the field of systems biology. Traditionally, biologists use a com-

bination of natural language and graphical depictions to describe biological systems.

This approach, while established, suffers from the drawback of being ambiguous. In

the absence of a set of standards, there is no unique way to interpret such diagrams,

which could result in confusion. Additionally, this lack of formality makes it difficult,

or even impossible, to process the descriptions automatically.

For these reasons, biological systems are an attractive field for the application of

formal computational modelling approaches and, indeed, many such methods have

been proposed and used. These methods can be thought of as considering the sys-

tem in question as a state machine, and the various interactions as transitions between

states. This has led to them being termed “executable models”, since the system can be

simulated or otherwise analysed by performing a “run” of the state machine [1]. This

computational view of biological processes allows the use of tools and ideas originally

developed for other areas of computer science. These can either be applied directly,

or serve as a base for the development of methods specifically suited to biological

systems.

The various approaches differ in their philosophy and, consequently, in their strengths

and weaknesses. There is no consensus on which approach is best, in part because the

suitability of each depends on the details of the specific modelling task. However, there

are some desirable properties that the ideal method would possess. First, it should be

unambiguous: a given model should allow a single interpretation. This can be achie-

Chapter 2. Background 4

ved, for instance, by the definition of a formal (mathematical) semantics. Secondly, it

should be executable, in the sense described earlier. Thirdly, it should allow various

types of analysis to be performed. Finally, it should offer a convenient interface for the

specification of models. This last point is particularly important when one considers

that the description language is intended to be used not only by computer scientists or

modelling experts, but also by experimental biologists.

The two methods with which this dissertation is concerned are presented in more

detail in the following sections.

2.2 Process algebras and Bio-PEPA

Process algebras (or process calculi) are formalisms originally introduced for model-

ling and studying systems of concurrent computation. The family of process algebras

is comprised of various, diverse languages, all of which share the underlying concept

of processes which can communicate with each other. A subset of those, stochastic

process algebras (SPA), additionally incorporate the element of quantified randomness

and lend themselves to simulation using well-known algorithms (e.g. [2]).

It has been suggested that the events taking place in biological systems are in some

ways similar to concurrent computation [3], and therefore process algebras (especially

stochastic ones) are particularly suited to biological modelling [4]. Much research has

taken place in applying them to this field, with promising results. Examples include

case studies using the stochastic π-calculus [5], CCS-R [6] and PEPA [7]. In these stu-

dies, a process usually serves as an abstraction for either a molecule or a biochemical

species. Similarly, communication events are used to represent biochemical reactions.

In general, process algebras offer a number of attractive features for describing

biochemical interactions. Firstly, they allow one to work at the desired level of abs-

traction. For instance, it is possible to generate a purely mathematical specification of

the system, e.g. in the form of ordinary differential equations (ODEs), that describes

the laws governing the evolution of the quantities involved, while at the same time one

can reason at a higher level about the behaviour of the system as a whole, e.g. via

model-checking [8]. The latter can also be used to verify whether the model matches

the expected behaviour (that of the real system), which can help in identifying mis-

takes. Furthermore, since process algebras are established modelling methods, these

analyses are already supported both by a theoretical background and by a number of

available tools. An additional advantage is that of compositionality, meaning that des-

criptions specified this way are modular, making them easier to combine together or

alter. In general terms, therefore, apart from offering an array of analysis methods, the

use of process algebras makes biological models more structured, less error-prone and

easier to manipulate.

Despite these advantages and encouraging results, it became apparent that some

features of biological systems were difficult to capture using existing languages. This

led to the development of new process algebras, such as Beta-binders [9], whose ex-

pressivity was designed with biological systems in mind. Development of these lan-

guages was often realised by extending existing process algebras, adapting them to the

needs of biological modelling. Another example of this is Bio-PEPA [10], adapted

from the Performance Evaluation Process Algebra (PEPA) [11].

A Bio-PEPA model consists of a number of species, which are defined in terms of

the reactions they can take part in. The syntax allows the modeller to specify both the

role of the species in the reaction (e.g. reactant, product, activator) and its stoichio-

metric coefficient. Additionally, each reaction is accompanied by a kinetic rate law,

which represents the kinetics of the corresponding biochemical reaction. Of particular

importance is the ability to specify arbitrary kinetic laws, although special provisions

are taken in the syntax for three commonly used laws (mass action, Michaelis-Menten

and Hill kinetics). This support for general kinetic laws, as well as for specifying

stoichiometry, are two of the features that make Bio-PEPA particularly suitable for

this kind of modelling, and were introduced to deal with the limitations of traditional

process algebras, as mentioned above.

In addition to the definition of species and kinetic laws, a Bio-PEPA model consists

of two more parts. The first is a list of compartment definitions. The second is the

model component, which specifies the initial state of the system (which species are

present and in what quantities), as well as the interactions between species.

As an example of the syntax, the following snippet contains the species definitions

for a system with three proteins, A, B and C, in which B can activate A and C can

deactivate it.

A_active = activation↑ + deactivation↓ ;

A_inactive = activation↓ + deactivation↑ ;

B = activation⊕ ;

C = deactivation⊕ ;

There are two reactions, named activation and deactivation, and four species. The

model component for this system, assuming that initially all molecules of A are active,

would have the following form, in which the numbers in brackets are molecular counts:

A_active[10] <*> A_inactive[0] <*> B[1] <*> C[1]

Development of Bio-PEPA models is typically done using the Bio-PEPA plugin

[12] for the Eclipse IDE, which provides a graphical interface and connections to other

tools, such as PRISM [13] for model checking or numerical ODE solvers. It also

allows exporting the model to SBML format.

2.3 The Narrative Language

Despite the effectiveness and suitability of process algebras in modelling biological

systems, their syntax effectively restricts their use to computer scientists and makes

them difficult to use by experts in the field, i.e. biologists. The Narrative Language

(NL) [14] was proposed as a formalism that would be approachable by biologists,

allowing them to use a syntax closer to natural language and a vocabulary familiar to

Instead of a formal definition, the semantics of the NL was initially defined indi-

rectly, by a translation to Beta-binders [15]. This was an additional influence on the

syntax of the language, leading to the inclusion or exclusion of certain features.

A NL model consists of four types of entities. Compartments describe the different

locations of the system. Components are the biochemical species, which can hold in-

ternal states. These states are binary and can either be defined for a specific site of the

component, or be component-wide. Reactions represent the biochemical reactions, in-

cluding a kinetic rate law. Finally, events describe the sequence of interactions which

can take place in the system as well as the conditions under which they occur, thus

specifying a narrative. Every element of the model is specified as tuples of values, like

records in a database. These generally include an integer ID and/or a name, as well

as other information related to the entity’s type- for instance, compartment definitions

include a size attribute, while components specify an initial quantity and state. Defini-

tions also refer to other parts of the model: an event definition, for example, includes

a reference to a reaction, which is taken to be the underlying reaction “implementing”

the event.

Some more details should be given about events, as they define the rules of the

dynamic behaviour of the system and will be referred to very often in the next chapters.

An event’s formal description is composed of two parts. The first, which is optional,

is a list of conditions. There are two types of these: state conditions, which impose

constraints on the values of a component’s states and sites, and location conditions,

which specify the compartment in which the component is found. Both of these types

can exist in positive or negative form. An example of a positive state condition is “if A

is active”, whereas one of a negative location condition is “if A is not in 1”.

The second part is the event description, which describes a change to one or more

components. There are various types of descriptions, each with its own syntax: some

events describe a change in the state of a component, such as a phosphorylation. A

special case of this, which is treated differently here, are binding and unbinding events.

Other events describe the synthesis or degradation of a component, while others the

relocation of a component from one compartment to another.

For example, if modelled in the NL, the example system from the previous section

would contain the event descriptions:

if A is not active then B activates A

if A is active then C activates A

which would follow definitions for the components A, B and C. The definition for A,

for instance, would specify that it has a state of type “active”.

The development of the NL is still ongoing, which allowed certain flexibility in

the context of the project. Specifically, we investigated whether its expressive power

could be altered, keeping in mind what features can be expressed in Bio-PEPA. We

found that, with some rare exceptions which will be described in the following chap-

ter, all features of the original syntax can be supported when translating to Bio-PEPA,

and at the same time we took this opportunity to extend the language by adding two

features. The first is the definition of constants, which can have either integer or real

values, and can be used everywhere a numerical value can occur. The second is sup-

port for Michaelis-Menten and Hill kinetic laws, in accordance with the Bio-PEPA

predefined functions, while the original version of the NL only supported mass-action

kinetics. The full grammar of the language supported for this project can be found in

Appendix A, adapted from [15].

2.4 Other modelling languages

Petri Nets [16] are another example of a modelling formalism that had been mostly

used for computational systems before being applied to systems biology (although

they were initially conceived for concurrent systems of any kind, including biochemi-

cal). They are networks composed of two types of elements, places and transitions,

and are fully specified using a set of functions defined on these. They were introduced

for the analysis of concurrent processing systems and, as with (Bio-)PEPA, the cor-

respondence between computational and biological concepts is straightforward. Their

semantics offers a method for the stochastic simulation of the described system, similar

to that in Bio-PEPA.

Rewriting systems, such as P systems [17] and Kappa [18], model a system using

a set of rules, each of which describes a change that one or more components of the

system can undergo. In this sense, the Narrative Language is closer to these languages

than to process algebras, which are component-based rather than rule-based.

Lastly, for the sake of completeness, mention should be made of the efforts to-

wards the standardisation of graphical depictions of systems. The Systems Biology

Graphical Notation (SBGN) [19] has been introduced for that purpose, and offers a

library of symbols and a standard set of rules for their use. While neither executable

nor fully formal or unambiguous, it is an important step towards formalising graphical

descriptions.

Chapter 3

Translation Algorithm

This chapter describes the algorithm that was developed for translating a model des-

cription from the NL into Bio-PEPA. The algorithm takes as input a model written

in the NL using the syntax presented in Appendix A and outputs a Bio-PEPA model.

For the translation to be accurate, the model thus produced should be equivalent to the

original, meaning that they should exhibit the same dynamic behaviour. Additionally,

it is desirable that the resulting model be as easily readable as possible, and as similar

as possible to what someone would write if they were modelling the same system in

Bio-PEPA directly.

The algorithm described here assumes that a NL model has been extracted from

the input text file, i.e. we have access to all the parts defined therein. This obviously

requires that the input file be parsed, but, since this is more of a technical issue rather

than a part of the algorithm itself, details about the parsing stage and its implementation

are presented in the next chapter.

3.1 Validation

Before the translation itself, some preprocessing needs to take place, in order to ensure

that the translation can be completed without problems. This stage entails checking

the definitions in the model for internal inconsistencies and reporting them. A com-

ponent is deemed to be valid if its definition satisfies a set of conditions, detailed in the

following. Similarly, conditions are defined for compartments and events. In order for

a model to be valid, all its parts must be valid. All the parts are checked one by one; if

any inconsistencies are found at any time, the validation of the model fails and we do

not proceed with the translation.

Chapter 3. Translation Algorithm 10

In the case of components, the conditions have to do with the compartments it can

be found in. A component’s definition lists a number of compartments, each accom-

panied by a number, which indicates the percentage at which the component is found

in that compartment at the initial state. For a component to be valid, all the following

must be true:

• All the compartments referred to in the list are defined in the model.

• The percentages of the initial concentration across all listed compartments sum

to exactly 100.

A compartment is valid if the following conditions hold:

• If an external compartment is specified, it corresponds to a compartment defini-

tion in the model.

• If an external compartment is specified, it is a different compartment, that is,

circular definitions are not allowed.

• The dimension of the compartment is either two (indicating a membrane) or

three.

An event is valid if the following hold:

• The reaction specified for the event is defined in the model.

• The type of event (e.g. phosphorylation) matches the type of reaction, as speci-

fied in the reaction definition.

• If a reverse or alternative event is specified, that event is defined in the model.

• The event description is valid.

• All of its conditions are valid.

• There is at most one positive location condition (of the form “A is in 1”) for

every component.

The reasoning behind the last requirement is that a conjunction of positive location

conditions for the same component (e.g. “A is in 1 and A is in 2”) would make

an event impossible, and we would like to have excluded that possibility when proces-

sing the model.

An event description is valid if the following hold:

• All components referred to are defined in the model.

• If a site is referred to, the corresponding component contains the site in its defi-

nition.

• If the event is a relocation event, the new compartment is in the list of compart-

ments defined for the component.

• If the event describes a component-wide state change, the component’s definition

contains the corresponding state.

• If the event describes a site-specific state change, the site has the corresponding

state in its definition.

An event condition is valid if the following hold:

• The component mentioned is defined in the model.

• If it is a location condition, the compartment specified is in the list of compart-

ments defined for the component.

• If it is a state condition referring to a specific site, the component has the speci-

fied site and the state mentioned (e.g. active) corresponds to the one in the site

definition.

• If it is a state condition referring to a component-wide state, that state is defined

in the component definition.

It is easy to see that these conditions depend directly and in a simple way on the

definitions contained in the model, and therefore the validation can take place without

any need for analysing the dynamic behaviour of the system described.

Example 1. Consider two events with the following formal descriptions:

if A is in 2 and B is active then B phosphorylates A

if A.Y23 is active then A deactivates C

In order for the first definition to be valid, the definition of component A must

include compartment 2 in the list of compartments. Additionally, B must have a

component-wide state named (of type) “active”, and A must have a component-wide

state of type “phosphorylated”. There is a positive location condition referring to A

but it is the only one, so that poses no problem.

For the second definition, A must have a site named “Y23” of type “active”. Com-

ponent C must have a component-wide state of type “active”.

In both cases, the full event definition will include a reference to a reaction. This

reaction must be defined in the model and its type must correspond to the event des-

cription, i.e. the reaction mentioned by the first event must be declared as a phospho-

rylation, and the one by the second as a deactivation.

These conditions are checked one by one and, if any of them are found to be false,

the validation process stops with a negative result. �

3.2 Processing

Before the algorithm is presented, an issue must be pointed out that is the centre of the

translation process. In the NL, a component can record changes by changing the status

of its sites and component-wide states. On the other hand, Bio-PEPA species have

no way of holding internal state. This means that there is not a one-to-one mapping

from NL components to Bio-PEPA species, but rather a one-to-many correspondence.

Each of the species to which a component corresponds represents a variant of that

component, i.e. a different combination of values in its sites and states.

In order to enumerate the variants of a component, in the following we use the

naming convention explained here, which is the same one used in the implementation

of the algorithm. If a component is named A, then the names of its variants begin

with A. Then, for every component-wide state, the value of that state (e.g. “active”,

“unphosphorylated”) is appended. After that, for every site, a string of the form x:y is

appended, where x is the name of the site and y is its state (as before). All the appended

strings are separated by intermittent underscore characters (“_”). The component-wide

states are appended in alphabetical order, according to the positive value of the state

(i.e. “active” would come before “phosphorylated”, therefore “inactive” would come

before “unphosphorylated”). The strings corresponding to the sites are added in alpha-

betical order, according to the site name (i.e. “active-site:inactive” would come before

“hydrolysis-site:unhydrolysed”).

Additionally, if we need to refer to a variant which exists in a specific location, we

adopt the notation used in the Bio-PEPA plugin. This uses the form x@y, where x is

the variant’s name and y is the name of the location.

Type Description Affected Involved Keywords

Monomolecular A keyword A - (de)phosphorylates,

(de)hydrolyses, synthesises,

degrades

Bimolecular A keyword B B A as above, plus (de)activates

(Un)binding A (un)binds B A, B -

Table 3.1: Affected and involved components by event type

3.2.1 Getting the species involved

This is done in two steps. The first is to obtain the components that are involved in the

event, which is easily achieved by examining the event’s formal description. We dis-

tinguish between two ways in which a component can take part in an event: an affected

component is one that undergoes some change during the event, such as a state change

or binding, while an involved component is one that acts without being affected. In

biochemical terms, the affected components correspond to the reactants and products

of a reaction, and the involved ones to enzymes or inhibitors. This distinction is useful

for determining the Bio-PEPA species that participate in a reaction and their roles in

the resulting model, as will be explained. In most cases, an event will have one affec-

ted and at most one involved component, but this depends exclusively on the type of

event, as seen from its description.

For monomolecular events of the form “A keyword”, where keyword is one of

“phosphorylates”, “dephosphorylates”, “hydrolyses” or “dehydrolyses”, the only com-

ponent that takes part is A, and it undergoes the change indicated by the keyword. The-

refore, the only affected component is A, with no involved components. Bimolecular

events have the form “A keyword B”, where keyword can also be “activates” or “deac-

tivates”, in addition to the previous options. In this case, A causes a change in B but is

itself unaffected. Therefore, B is an affected and A is an involved component.

There are four types of events that are treated differently: synthesis, degradation,

binding and unbinding. The first two can, at this stage, be treated exactly as described

in the previous paragraph, depending on whether they are monomolecular (“A synthe-

sises/degrades”) or bimolecular (“A synthesises/degrades B”). Finally, for a binding

event of the form “A binds B”, both A and B are considered affected and there are no

involved components, and the same is true for unbinding events.

The above descriptions are summarised in Table 3.1.

Once the affected and involved components are identified, we need to obtain the

corresponding Bio-PEPA species. As mentioned above, one NL component can, in

general, correspond to more than one species in the Bio-PEPA model. However, if a

component is involved in or affected by an event, it is not necessarily the case that all

of its variants can take part in it. This is because an event description can include the

conditions under which it can occur, imposing restrictions on when the component can

take part. It is therefore these conditions that determine which variants of a component

are associated with an event.

Let getVariants(component,conditions) indicate a function which takes a com-

ponent and a list of conditions and returns a list of the variants of that component which

satisfy all the conditions in the list. Since conditions can refer either to the state or to

the location of the component, the task of identifying the variants can correspondingly

be divided into two sub-tasks.

The first is that of applying the restrictions indicated by the state conditions. This

process is denoted getStateVariants and can be achieved by iterating through all

of the component’s sites and component-wide states and checking whether a condition

exists for that site or state. If so, the corresponding value is appended to the variant

name. If no condition is present, then two new variants are created, one for each value

of the site or state. This is described in Algorithm 3.1, where “positiveValue” and

“negativeValue” denote the two possible values of a site or state- for example, for a

site of type “active”, the values are “active” and “inactive”, respectively.

Example 2. Consider a simple model with two components, A and B. A has a site x

which can be active or inactive, and the entire component can be hydrolysed or unhy-

drolysed. B can be phosphorylated or unphosphorylated. There are two compartments

in the system, cytosol (with ID 1) and exterior (with ID 2). B can be in either compart-

ment, while A can only exist in the cytosol. There is only one event, with the formal

description:

if A.x is active and B is in 1 and B is not phosphorylated then A

phosphorylates B

This is a bimolecular state-changing event, so according to Table 3.1, B is the affec-

ted and A is the involved component. We first obtain the variants of each component

specified by the state conditions, using getStateVariants. The only condition on

A is that the site x be active, but there are no constraints on the component-wide hy-

drolysis state. Therefore, there are two variants of A that can take part in the event,

and getStateVariants(A) = {A_hydrolysed_x:active, A_unhydrolysed_x:active}.

Algorithm 3.1 getStateVariants: Creating the variants from the state conditionsvariants := {component}

for each component-wide state donewVariants :=∅if list contains a condition for state then

for each variant ∈ variants doadd variant_condition to newVariants

end forelse

for each variant ∈ variants doadd variant_positiveValue to newVariants

add variant_negativeValue to newVariants

end forend ifvariants := newVariants

end forfor each site do

newVariants :=∅if list contains a condition for site then

for each variant ∈ variants doadd variant_site:condition to newVariants

end forelse

for each variant ∈ variants doadd variant_site:positiveValue to newVariants

add variant_site:negativeValue to newVariants

end forend ifvariants := newVariants

end forreturn variants

Algorithm 3.2 getCompartments: Processing the location conditionscompartments := component.compartments

for each condition ∈ conditions doif condition is positive then

return condition.compartment

elsecompartments := compartments\{condition.comparment}

end ifend for

B only has one state and is constrained by its state condition, so getStateVariants(B)

= {B_unphosphorylated}. �

The second sub-task concerns applying the location conditions. This can be done

independently of the results the previous procedure, as described here and also shown

in Algorithm 3.2. If there is a positive location condition specifying a compartment

(if one exists, then the validation process ensures that it will be unique), we return

that compartment. Otherwise, we start with a list of possible compartments from the

component’s definition, and exclude those for which a negative condition exists. In

the general case, this yields a list of compartments (containing a single element in the

case where a positive condition is present). We call the function performing this task

getCompartments.

Example 3. Continuing the previous example, we now use getCompartments for the

two components. There are no location conditions for A, so the function will return

all the compartments listed in its definition: getCompartments(A) = {cytosol}. The

condition on B is positive, so Algorithm 3.2 terminates immediately and returns the

corresponding compartment: getCompartments(B) = {cytosol}. �

We then combine the results of the two algorithms as follows. For every variant

returned by getStateVariants, we create a set of new variants, obtained by appen-

ding to it every compartment returned by getCompartments. We then return the union

of those sets (Algorithm 3.3). The Bio-PEPA syntax allows one simplification of this

procedure: if no compartment is specified after a species name, then that refers to the

species in all possible locations. Therefore, if no compartments are excluded during

Algorithm 3.2, i.e. if there are no location conditions, we simply return the variants as

obtained from Algorithm 3.1, without appending anything.

Algorithm 3.3 getVariants: Getting the final list of speciesstateVariants := getStateVariants(component,stateConditions)

compartments := getCompartments(component,locConditions)

if compartments = component.compartments thenreturn stateVariants

elsevariants :=∅for each variant ∈ stateVariants do

for each compartment ∈ compartments dovariants := variants∪{variant@compartment}

end forend forreturn variants

end if

Example 4. Using the results of the two previous examples, we can use Algorithm 3.3

to obtain the final list of variants. For A, we note that getCompartments returned all

possible compartments, and so we do not need to append anything to the state variants.

Thus, getVariants(A) = {A_hydrolysed_x:active, A_unhydrolysed_x:active}. For

B, getCompartments returned only the cytosol compartment. Since this is not the

full list mentioned in the definition, we will append the compartment name to the

variants returned by getStateVariants. In this case, there was only one, and so

getVariants(B) = {B_unphosphorylated@cytosol}. �

3.2.2 Applying the event

The steps so far have given us the species that act as enzymes (from the involved

components) and as reactants (from the affected components). However, we still need

to specify the products of the reaction. From the definitions in the previous sections,

we know that the products will be variants of the affected components, differing from

the variants that act as reactants to reflect the change described by the event. For an

activation event, for instance, the input (reactant) variant may be A_inactive, in which

case the output (product) would be A_active.

In general, the product species depends on the reactant. This can be seen from the

fact that an event changes a site or state of a component, but, since input variants can

differ on other sites or states, the corresponding outputs should also differ. However,

Event type Old state New state

Activation inactive active

Phosphorylation unphosphorylated phosphorylated

Hydrolysis unhydrolysed hydrolysed

Table 3.2: Substitutions for state-changing events

for a given reactant species and event, the product species is fully determined. The

exact method for determining the output depends on the type of the event.

For events that change the state of a component or a site (activation, phosphoryla-

tion, hydrolysis and their inverse), the mapping from input to output is straightforward.

Using the naming convention described earlier, applying the event consists of a simple

substitution in the name of the species, changing the old state to the new one, as in the

example above. The possible changes are shown in Table 3.2. For the inverse events

(e.g. deactivation), the old and new states are simply switched. In the case where an

event affects a site rather than a component-wide state, the pattern to be substituted is

site:old-state, changing to site:new-state.

Relocation events can similarly be represented as name changes. In this case, the

part of the species name that changes is the @location suffix, and it is updated to reflect

the destination compartment.

The above processes can be encapsulated in a function applyChange, which takes

a species name and an event as arguments, and returns the species name obtained after

applying the change described by the event.

The cases of degradation and synthesis events are treated differently. For the

former, there is no output. For the latter, the output should be a variant of the af-

fected component. However, the event description does not provide any conditions

for the product, and so we cannot use the getStateVariants function described

previously to get a concrete species. In light of this, we have decided to use the

component definition– specifically, use the initial state described in it. We define a

function initial which acts on a component, treating its initial state specification

as a list of conditions and returning the corresponding variant. Note that, unlike

getStateVariants, this will always return a single variant and not a list, since the

initial state specifies the value of all sites and component-wide states.

For bindings, the output species will be a complex and its name will have the form

x::y, where x and y are the input variants. Similarly, the output of an unbinding event

Event type Input Output

State-changing, relocation A applyChange(A)

Synthesis ∅ initial(A)

Degradation A ∅Binding A, B A::B

Table 3.3: Output by event type

can be found by splitting the name of the input variant. A more detailed analysis of

how complexes and (un)bindings are treated is contained in the following chapter.

Table 3.3 summarises the different ways of obtaining the output for the various

reaction types.

3.2.3 Defining the reactions

Once all the participating species have been determined through the above process,

the final part of the algorithm is trivial and consists mainly of combining the species

obtained in the previous steps. First, we note that if the event conditions give rise to

more than one variant for both the affected and involved components, there is more

than one way to combine them and no reason why a combination should be excluded.

For every input variant (or combination of them from different affected compo-

nents), we obtain the corresponding output variant(s). We then take every combination

of these with the variants of the involved components and, for every such combination,

we define a new Bio-PEPA reaction. That reaction will have the same kinetic law as

the reaction corresponding to the original event. We also record each species’ role wi-

thin the reaction: input variants as reactants, output ones as products and variants of

the involved components as enzymes. This can be thought of as creating a variant of

the NL reaction specified by the event.

Example 5. In the running example, we have one affected component (with one va-

riant) and one involved (with two variants). We start with the affected (input) variant,

B_unphosphorylated@cytosol. Since the event is a phosphorylation, in order to obtain

the product of the reaction we use applyChange, which results in the output variant

B_phosphorylated@cytosol. We now have to combine these with the two variants of

the involved component. This will define two Bio-PEPA reactions, both of which will

have the same reactants and products (the input and output variants referred to above,

respectively). The difference between the two reactions will be the enzyme, as in each

one it will be a different variant of the involved component.

If there were more variants of B, we would repeat this process, using a different

variant as input each time, finding the output and taking the combinations, as above.�

3.3 Output

Having completed the processing steps, writing the output model also becomes straight-

forward.

The first parts to be written are constant and location definitions. These will not

be affected by the translation (unlike components and reactions) and, since all the

necessary information is contained in the NL definitions for them, all that is needed is

simply converting the definitions to the Bio-PEPA syntax.

The next part of the model is the specification of kinetic laws. Each of the newly-

defined Bio-PEPA reactions has the same kinetic law as the NL reaction of which it is

a variant, and this can be retrieved from the reaction definitions in the original model.

The NL syntax we have adopted for this project supports constant rates, as well as

mass-action, Michaelis-Menten and Hill kinetics, all of which are supported in Bio-

PEPA, so we simply need to iterate over the reaction variants and use the appropriate

syntax.

Next, we write the species definitions. As we have recorded every species’ role

when defining the reaction variants, this is simply a question of iterating over the roles

associated with a species and converting them to Bio-PEPA syntax. The conversion

is trivial, since it only involves writing the name of the reaction variant and then the

symbol for the role, both of which are readily available.

Example 6. To complete the running example, assume that the two reactions defined

in the last example are named r1 and r2. The species definitions will then be:

B_unphosphorylated = r1↓@cytosol + r2↓@cytosol

B_phosphorylated = r1↑@cytosol + r2↑@cytosol

A_hydrolysed_x:active = r1⊕A_unhydrolysed_x:active = r2⊕ �

The final part is the model component. This must include all variants of every com-

ponent in the original model, as well as their initial quantities. For every component,

we first retrieve all its variants using getVariants with an empty condition list. We

must then combine the names of these variants with all possible compartments. The

component’s initial state definition specifies only one variant, so all the species will

have an initial quantity of zero, except for the one returned by initial, which will

have the initial quantity found in the definition.

3.4 Analysis: combinatorial explosion

Perhaps the most interesting and important characteristic of the algorithm as described

above is its complexity in terms of the number of species in the resulting model. As

described previously, one NL component can correspond to many Bio-PEPA species.

In general, a component with n states (that is, sites or component-wide states, as for

the purposes of this analysis the distinction is irrelevant) can result in 2n variants, given

that all states are binary. This exponential increase means that output models are likely

to be much larger than their corresponding input models, if size is measured as the

number of components/species.

The situation is further encumbered when it comes to reactions. We mentioned

above that every combination of the affected and involved variants must be considered,

and a new reaction defined for each of them. This means that the number of reactions

will also be exponential in the number of states. For the case of an event with one

involved component with n states and one affected component with m, if there are

no state conditions to impose, there are 2n and 2m variants which can take part in

the reaction as enzyme or reactant, respectively. Since we have to take all possible

combinations, this will result in 2n+m reaction variants being defined. While this worst-

case scenario might not be entirely realistic, as some conditions would likely exist, thus

limiting the number of variants, the number of reactions would still be exponential in

the number of free (unrestricted by conditions) states.

Unfortunately, it is not possible to completely avoid this issue. Ultimately, it is

rooted in the lack of a mechanism for specifying internal state in Bio-PEPA. This

is the reason for the one-to-many mapping from components to species, which then

propagates to the number of reactions.

However, we can attempt to reduce the number of species by analysing the model

and looking for useful properties. Informally, we can think of the transitions between

variants as forming a graph with the variants as nodes. We believe that this represen-

tation makes it easier to describe the problem and possible solutions, and also offers a

variety of tools in the form of graph theory. We present the general idea behind two

x:inactive

y:inactive

x:inactive

y:active

x:active

y:active

x:active

y:inactive

Figure 3.1: State transition graph for the variant-pruning optimization example

optimizations, illustrated by examples of cases in which they may be useful.

3.5 Optimizations

The first optimization involves reducing the number of species by identifying variants

which are never going to appear, as shown in the example.

Example 7. Consider the component A with two sites, x and y, both of which can be

active or inactive. Their activation is described by the following four events, which

determine the order in which activations can occur.

if A.x is not active and A.y is not active then B activates A on y

if A.x is not active and A.y is active then B activates A on x

if A.x is active and A.y is active then C deactivates A on x

if A.x is not active and A.y is active then C deactivates A on y

Examining the sequence of events that can occur, we can see that there is no way

for x to be active if y is not also active. The corresponding graph of state transitions

is shown in Figure 3.1. The variant A_x:active_y:inactive is unreachable and can the-

refore be removed, reducing the number of species derived from A to three. Any

reactions in which this species was involved can also be deleted, further reducing the

size of the output model. �

One issue to consider with this optimization is whether the nodes to be pruned

should be unreachable from anywhere or from the initial state. Choosing the latter im-

poses a less strict condition and may therefore allow to prune more variants. However,

it would mean that changing the initial state specification in a component’s definition

would require the translation algorithm (or at least this optimization step) to be execu-

ted again, as different variants may be reachable from the new initial state. On the other

hand, searching for nodes that are completely isolated might be too strict a restriction

for any noticeable optimization.

The idea behind the second optimization is to find groups of variants which behave

in a similar way, and then replace each group by a single variant.

Example 8. Consider a component A with three phosphorylation sites, x, y and z and

the following event descriptions:

if A.x is not phosphorylated then B phosphorylates A on x

if A.y is not phosphorylated then B phosphorylates A on y

if A.z is not phosphorylated then B phosphorylates A on z

Assume that the three sites are identical, in the sense that they are not involved

in any other reactions (or, if they are, those are exactly the same for each site) and

that all the above phosphorylation reactions occur at the same rate (or simply that the

three events are defined to have the same underlying reaction). It can be seen that the

only thing that matters is the number of phosphorylated sites, and not the specific sites

themselves. For instance, the variants

A_x:phosphorylated_y:unphosphorylated_z:unphosphorylated

A_x:unphosphorylated_y:phosphorylated_z:unphosphorylated

A_x:unphosphorylated_y:unphosphorylated_z:phosphorylated

can be considered equivalent, since they all have exactly one site phosphorylated. If

we replace them by a species A_1phosphorylated, and similarly define the species

A_0phosphorylated, A_2phosphorylated and A_3phosphorylated, the resulting model

has only four species, compared to eight if no optimization takes place. �

In the general case of a component with n identical sites, this optimization would

produce n+ 1 variants, while the original unoptimized version would contain 2n spe-

cies. This reduction from exponential to linear complexity indicates that this optimi-

zation may prove more powerful in reducing the size of the model than the first one.

However, it would require a deeper analysis of the model, and the conditions it requires

might prove too strict for it to be applied beyond a limited number of cases.

Chapter 4

Implementation

In this chapter, we give more details about the concrete implementation of the algo-

rithm presented in the previous chapter. We point out some small differences between

the theoretical description and the implemented version, as well as some issues we

came across and how they affected the implementation process.

4.1 Parsing

The first necessary step in the translation is to parse the NL model. For this project,

this was done using Xtext [20], an Eclipse plugin for the definition of domain-specific

languages, which offers several advantages. First is its ease of use: the user only needs

to describe the grammar of the language using a syntax similar to Backus-Naur Form.

In our case, therefore, this description closely resembled the contents of Appendix A,

simplifying the process of writing it. The plugin then generates code for the lexical and

syntactical analysis of files written in that language, and, once the parsing is complete,

makes available an object-oriented model of the file, which can be easily accessed by

short and simple code. Additionally, Xtext allows the user to create dynamic tools

such as editors, integrated into the Eclipse environment. While this was not realised

in the project, having this potential for future expansion was a further motive for the

selection of Xtext.

4.2 Processing

The main code for the system is written in Java, for maximum compatibility with the

already existing Bio-PEPA plugin. In general, the implementation closely follows the

Chapter 4. Implementation 25

Figure 4.1: Activity diagram of the system

description of Chapter 3, reflected in Figure 4.1.

The Validator class is tasked with the validation procedure which, apart from ma-

king sure the relevant conditions as laid out earlier are satisfied, also involves some

preprocessing on the model, such as building hash tables for efficient retrieval of com-

partments using their ID or name. All the results of this preprocessing are packaged

in an object of class ProcessedModel, which serves as a means of communication bet-

ween this stage and the main processing module. If any part of the NL model is found

to violate the consistency criteria, an exception is raised and the application exits. We

define the exception class NarrLangModelException for use in these situations, with

an appropriate error message detailing the reason why the model was rejected.

Figure 4.2: (Simplified) class diagram of the system

The next step is the processing itself, in which the main bulk of the work takes

place. The corresponding class is Transformer, and it has two tasks. The first is to

process each event in turn, and so produce a list of roles for the species involved.

These are represented by objects of type Role or its subtype, RelocationRole. The

reason for distinguishing roles in relocation reactions is their different syntax in species

definitions, which can be easily dealt with by subclassing, like we do here. Figure 4.2

shows the different classes mentioned so far.

Keeping with the different treatment of the various event types in the algorithm

description, we also have different actions based on an event’s type here. This is re-

flected in the class hierarchy shown in Figure 4.3. The idea is that, while obtaining

the involved and affected components and processing the conditions is more or less a

shared process across event types, the ways in which these species are combined and

their products are obtained are quite different. Here, we use subclasses to let an event’s

type determine the appropriate course of action for the species taking part.

Following the processing, the second task of Transformer is writing the model.

This is done as described in the algorithm, using the Role objects from the previous

step. One interesting detail has to do with the writing of the model component, during

which we take advantage of the syntax to define submodels, each comprising all the

species variants corresponding to a NL component. A long list of species is thus broken

down into smaller ones, making the model component more readable and easier to

modify.

Figure 4.3: The different event classes. Class names have been shortened from “Pro-

cessedEvent” etc. for the sake of presentation.

4.3 Integration with the plugin

The original plan for the project was to integrate the translation application in the

existing Bio-PEPA plugin. The addition would contribute a new menu item, which

would bring up a simple dialog. From there, the user would be able to browse for a

NL file and select the location where the output file would be created. The process

would entail the creation of two new classes, one to serve as delegate for the action

of selecting the menu item, and another for the simple dialog wizard that would be

brought up by that action and, finally, some minor edits to an XML file describing the

plugin to include the addition.

Unfortunately, due to time limitations and the additional need to become familiar

with the Eclipse plugin development environment, this final part of the project could

not be completed. It should be stressed, however, that it is only the graphical inter-

face and integration that were not realised, and that the translation application itself is

functional.

4.4 Design issues and decisions

In Section 3.1, we presented the requirements that a model must fulfil in order to

be considered valid for the purposes of this project. However, as presented there,

these rules allow for various constructions which may seem unintuitive or biologically

unsound. As an example, an event in which two components interact even though they

are in different compartments is considered valid under these criteria. Moreover, no

steps are taken during the processing of the event to enforce that the components must

be in the same compartment (e.g. by only considering the intersection of their possible

locations).

This was a conscious design decision on our part: we chose to allow the modeller

the highest degree of flexibility (within the limits of the syntax and insofar as no logi-

cal contradictions or inconsistencies occur). This has the obvious side-effect of making

the process more error-prone, as fewer checks are in place to secure the validity of the

model. Nevertheless, we believe that, overall, our decision is not detrimental to the

quality of the translation system. Firstly, a very serious modelling error may well be

obvious or at least easy to detect from analysing the resulting Bio-PEPA model, there-

fore our approach does not necessarily mean errors will go unnoticed. Furthermore, the

additional flexibility may prove to be useful in cases where a higher-level abstraction

is desired–for instance, if someone wishes to omit (possibly unknown) intermediate

reactions, which when removed result in the image of components interacting from

different locations, as mentioned earlier.

Another example of this is that we allow components that do not directly take

part in a reaction to influence it indirectly. Thus, an event’s conditions may involve

a component even if it does not act as a reactant or enzyme, such as in the event

if A is phosphorylated, B binds C. In this case, the variants of A specified by

the condition will be represented in the corresponding Bio-PEPA reactions as generic

modifiers–a role that means they must be present for the reaction to occur, but they do

not affect its rate. It is not immediately clear to us in what biochemical context such an

event might arise, except perhaps when considering abstractions as described above,

but the fact that it could be expressed in Bio-PEPA made us allow it for the translation.

On the other hand, in some cases we were restricted by the Bio-PEPA syntax in

what we could express. Such is the case of relocation reactions, for which the current

syntax does not allow the definition of modifiers (including enzymes). We therefore

have to ignore any conditions on non-participating species, which would have other-

wise been represented as generic modifiers like in the previous paragraph. This also

means that the use of some kinetic laws is not allowed, as it requires that a species

be defined as an enzyme for the reaction. Therefore, for the translation, all relocation

reactions must either have a constant rate or follow mass-action kinetics.

4.4.1 Treatment of bindings and complexes

One part of the implementation worth focusing on is the treatment of bindings. As

mentioned in the previous chapter, we decided to represent a complex as a single spe-

cies with a name of the form A::B, rather than two species with their bound states ac-

tive. This representation is closer to what one would do if one was modelling directly

in Bio-PEPA, and makes the output model more readable. However, it introduced a

number of complications, because binding sites and the conditions on them now had

to be treated completely differently than other types of sites. For instance, if A has a

binding site and can bind to B or C, then the condition “if A.x is active” might

not represent only the species A_x:active, but also A_x:active::B and A_x:active::C,

or even more if B and C have more that one variant.

In order to avoid confusion, we decided to make the following assumption: un-

less there is explicit mention of a component being bound somewhere in an event’s

description, that component is assumed to be free (i.e. unbound). This means that,

unless there is a state condition that forces the component to be bound, or the com-

ponent participates in the event as part of a complex, then we do not consider what

other components it might be bound to.

Chapter 5

Evaluation

5.1 Test cases and procedure

Testing of the project’s performance occurred in two stages. In the first, while the

code was still being developed, the aim was mostly to locate bugs. For that reason, we

used a number of ad-hoc examples, chosen not because of their biological significance

(since they had none) but in order to test the system under a variety of event types and

make sure it responded correctly.

Once the implementation was finished and adequately tested, we chose two example

models on which to evaluate its behaviour. The evaluation criteria we chose are related

to the task for which the system is designed, keeping in mind that the translation is a

one-off procedure, but the result might be used multiple times. Therefore, we chose

not to measure the execution time of the application, but rather to focus on the charac-

teristics of the output. Specifically, we wanted to see how easy the output model is to

read, and how close it is to something written directly in Bio-PEPA. Obviously, these

two “metrics” are very subjective and difficult (if not impossible) to measure. Howe-

ver, we believe that these aspects are more important than the temporal performance of

the algorithm, given what its purpose is. Additionally, for each example, we compared

the dynamic behaviour of the translated system to that of the “original” Bio-PEPA one,

to make sure that they match.

The two models that we used for evaluation are relatively simple but are often used

as examples for modelling. The first is a model of a generic enzymatic reaction with

two steps, which was at the same time a good way to test how the implementation

handles binding events. The second models part of the MAPK cascade [21], involving

three proteins, each of which activates the next.

Chapter 5. Evaluation 31

Figure 5.1: Simulation results (1000 replications) on the original Bio-PEPA enzymatic

model (axes are molecule count vs. time)

5.2 Results and comments

The conclusions we drew for both examples are similar. We ran a stochastic simulation

on the original and the translated model, and verified that their behaviours are identical,

as can be seen from Figures 5.1 and 5.2. In this case, there is a one-to-one correspon-

dence between the species of the two models, and we could say that the translation is

quite near to the “natural” model.

The translated model is inherently harder to read, for at least two reasons which

have to do with the translation procedure. The first is that, during the translations, the

reactions are given alpharithmetic names indicating which NL reaction they originated

from. While systematic, this is not very helpful when reading the model and trying to

understand the function of a reaction from the species involved. It is clearly easier to do

it when one has the freedom to name reactions according to their purpose, something

which is impossible in the algorithm.

The other reason is that the names of the variants generated by the algorithm tend

to be long, especially for components with many sites or, worse, when they are bound.

Compare, for instance, the names in Figures 5.1 and 5.2. This makes the model look

more encumbered, especially in some cases where the species name has to be repeated

in its definition, leading to definitions that can become hard to understand or manipu-

late. An argument could be made that, even in the original Bio-PEPA model, this could

not be avoided if it is necessary to distinguish between different states of the same spe-

Chapter 5. Evaluation 32

Figure 5.2: Simulation results (1000 replications) on the translated Bio-PEPA enzymatic

model (axes as above)

cies, as the names could once again end up being variations with a similar prefix. In

any case, although the naming convention we have adopted leads to particularly long

species names, at least it makes it clear what each species represents, since the state is

directly reflected in the name. In this sense, it has perhaps the opposite effect of the

reaction naming algorithm.

In the case of the MAPK example, the number of species was also the same bet-

ween the two models (it was, however, double the number of NL components, as ex-

pected, since all components had one binary state). The small scale of the two example

systems did not lead to huge absolute increases in the number of species from the trans-

lation.

Chapter 6

Conclusions

We developed an algorithm for translating models written in the Narrative Language

into Bio-PEPA, and implemented it in a Java application. Although we did not manage

to integrate the application into the Bio-PEPA plugin, test results show that it succeeds

in capturing the behaviour described in the Narrative Language model. Additionally, in

the two test cases we analysed, the resulting models appeared to be comprehensible and

easy to read, although not as much as the ones written directly in Bio-PEPA. However,

they were very similar to them, which we can take as an encouraging sign for the

usefulness of this project.

In our view, the greatest problem with the algorithm is the combinatorial explosion

in the number of species during the translation. Even though we did not implement any

optimizations to the algorithm to deal with the issue, this was not a necessary part of

the original proposal. Furthermore, we believe the optimization ideas proposed earlier

in the discussion merit further investigation, and we reiterate that the problem cannot

be completely solved, as it is inherent in the current syntax of the language.

6.1 Future work

There are a number of ways in which the work described here can be extended. Firstly,

we believe that there is potential in exploring optimizations to the algorithm in order

to address the combinatorial issues. A first approach would therefore be to implement

a more sophisticated version of the algorithm, using the optimizations described in

Section 3.5. This would require researching how effective they are and in which cases

they can be applied with significant results. Another possibility that could be investi-

gated is the use of model-checking, instead of graph theory, to check for properties of

Chapter 6. Conclusions 34

the system. This line of research could perhaps shed light on particular aspects of the

problem and lead to the development of further optimizations.

An alternative approach for dealing with this, at least on a more superficial level,

would be to extend the Bio-PEPA syntax to include some form of denoting internal

state. This would be in the same spirit as the extension of the language to support

locations more extensively [22]. This included special syntax for the definition of

compartments and membranes, as well as for relocation of species, which led to models

being more readable and compact. However, the extension is provided only to help the

modeller and these descriptions are then automatically converted to the “traditional”

syntax. Similar “syntactic sugar” extensions could be introduced for specifying states

for species. It may also be interesting to look into extending the “true” syntax, i.e.

supporting such state qualifications without needing to map them to the original syntax.

A different direction in which this work can be extended would be to widen the

range of input languages supported. For instance, an approach similar to the this could

perhaps be applied in order to translate models from other rule-based formalisms, such

as rewriting systems. A more ambitious project would be to accept a model written

in something resembling, or at least closer to, actual natural language. Similar work,

although in a different scientific context, that of electrical systems, has been done in

[23]. The current algorithm can also be updated in order to include features that are

introduced in future versions of the NL.

Finally, on the implementation level, the integration with the Bio-PEPA plugin

could be furthered by developing an editor or other graphical tools for the specification

of models in the NL from within the plugin. The definition of the NL grammar in

Xtext, which was realised as part of the project, would serve as a useful basis for the

development of these tools.

Bibliography

[1] Jasmin Fisher and Thomas A Henzinger. Executable cell biology. Nat Biotech,

25(11):1239–1249, 2007.

[2] Daniel T. Gillespie. Stochastic Simulation of Chemical Kinetics. Annual Review

of Physical Chemistry, 58:35–55, 2007.

[3] Aviv Regev and Ehud Shapiro. Cells as Computation. Nature, 419(6905):343,

[4] Federica Ciocchetta and Jane Hillston. Process Algebras in Systems Biology. In

SFM’08, volume 5016 of LNCS, pages 265–312. Springer-Verlag, 2008.

[5] C. Priami, A. Regev, W. Silverman, and E. Shapiro. Application of a stochastic

name-passing calculus to representation and simulation of molecular processes.

Information Processing Letters, 80(1):25–31, 2001.

[6] Vincent Danos and Jean Krivine. Formal Molecular Biology Done in CCS-R.

Electronic Notes in Theoretical Computer Science, 180(3):31 – 49, 2007. Procee-

dings of the First Workshop on Concurrent Models in Molecular Biology (Bio-

Concur 2003).

[7] Muffy Calder, Stephen Gilmore, and Jane Hillston. Modelling the influence of

RKIP on the ERK signalling pathway using the stochastic process algebra PEPA.

In Transactions on Computational Systems Biology VII, number 4230 in LNCS.

Springer, 1–23 2006.

[8] Marta Kwiatkowska, Gethin Norman, and David Parker. Using probabilistic mo-

del checking in systems biology. ACM SIGMETRICS Performance Evaluation

Review, 35(4):14–21, 2008.

[9] Corrado Priami and Paola Quaglia. Operational patterns in Beta-binders. Tran-

sactions on Computational Systems Biology, 1:50–65, 2005.

Bibliography 36

[10] Federica Ciocchetta and Jane Hillston. Bio-PEPA: A framework for the model-

ling and analysis of biological systems. Theoretical Computer Science, 410(33-

34):3065 – 3084, 2009.

[11] Jane Hillston. A Compositional Approach to Performance Modelling. Cambridge

University Press, 1996.

[12] The Bio-PEPA Eclipse Plugin. http://homepages.inf.ed.ac.uk/s9552712/

bio-pepa/plugin.html.

[13] Andrew Hinton, Marta Kwiatkowska, Gethin Norman, and David Parker.

PRISM: A tool for automatic verification of probabilistic systems. In Proc. of

TACAS’06, volume 3920 of LNCS, pages 441–444, 2006.

[14] Maria Luisa Guerriero. From Intuitive Descriptions of Biochemical Systems to

Their Formal Analysis. PhD thesis, ICT School - DIT - University of Trento,

[15] Maria Luisa Guerriero, John K. Heath, and Corrado Priami. An Automated

Translation from a Narrative Language for Biological Modelling into Process Al-

gebra. In Proceedings of Computational Methods in Systems Biology (CMSB’07),

volume 4695 of LNCS, pages 136–151. Springer, 2007.

[16] Monika Heiner, David Gilbert, and Robin Donaldson. Petri Nets for Systems and

Synthetic Biology. In SFM’08, volume 5016 of LNCS, pages 215–264. Springer,

[17] Mario J. Pérez-Jiménez and Francisco J. Romero-Campero. P Systems, a New

Computational Modelling Tool for Systems Biology. Transactions on Computa-

tional Systems Biology, 6:176–197, 2006.

[18] Vincent Danos, Jérôme Feret, Walter Fontana, Russell Harmer, and Jean Krivine.

Rule-Based Modelling of Cellular Signalling. In CONCUR 2007 - Concurrency

Theory, volume 4703 of Lecture Notes in Computer Science, pages 17–41. Sprin-

ger Berlin / Heidelberg, 2007.

[19] Nicolas Le Novere, Michael Hucka, Huaiyu Mi, Stuart Moodie, Falk Schrei-

ber, Anatoly Sorokin, Emek Demir, Katja Wegner, Mirit I Aladjem, Sarala M

Wimalaratne, Frank T Bergman, Ralph Gauges, Peter Ghazal, Hideya Kawaji,

Bibliography 37

Lu Li, Yukiko Matsuoka, Alice Villeger, Sarah E Boyd, Laurence Calzone, Me-

lanie Courtot, Ugur Dogrusoz, Tom C Freeman, Akira Funahashi, Samik Ghosh,

Akiya Jouraku, Sohyoung Kim, Fedor Kolpakov, Augustin Luna, Sven Sahle, Es-

ther Schmidt, Steven Watterson, Guanming Wu, Igor Goryanin, Douglas B Kell,

Chris Sander, Herbert Sauro, Jacky L Snoep, Kurt Kohn, and Hiroaki Kitano.

The Systems Biology Graphical Notation. Nature Biotechnology, 27(8):735–741,

[20] Xtext. http://www.eclipse.org/Xtext/.

[21] Rony Seger and Edwin G. Krebs. The mapk signaling cascade. The FASEB

Journal, 9(9):726–735, 1995.

[22] Federica Ciocchetta and Maria Luisa Guerriero. Modelling Biological Compart-

ments in Bio-PEPA. In Proc. of MeCBIC’08, volume 227 of ENTCS, pages 77–

95. Elsevier, 2009.

[23] Alexander Holt and Ewan Klein. A semantically-derived subset of english for

hardware verification. In Proceedings of the 37th annual meeting of the Asso-

ciation for Computational Linguistics on Computational Linguistics, ACL ’99,

pages 451–456, 1999.

Appendix A. Narrative Language Syntax 39

Appendix A

Narrative Language Syntax

〈model〉 ::= 〈constants_decl〉〈comparts_decl〉〈compons_decl〉〈reacts_decl〉〈procs_decl〉

〈constants_decl〉 ::= Constants 〈constants_list〉〈comparts_decl〉 ::= Compartments 〈comparts_list〉〈compons_decl〉 ::= Components 〈compons_list〉〈reacts_decl〉 ::= Reactions 〈reacts_list〉〈procs_decl〉 ::= Narrative 〈procs_list〉

〈constants_list〉 ::= 〈constant〉| 〈constant〉〈constants_list〉

〈comparts_list〉 ::= 〈compartment〉| 〈compartment〉〈comparts_list〉

〈compons_list〉 ::= 〈component〉| 〈component〉〈compons_list〉

〈reacts_list〉 ::= 〈reaction〉| 〈reaction〉〈reacts_list〉

〈procs_list〉 ::= 〈proc〉| 〈proc〉〈procs_list〉

〈constant〉 ::= (〈const〉,〈quantity〉)〈compartment〉 ::= (〈id〉,〈compart_name〉,〈opt_size〉,〈opt_unit〉,〈opt_dim〉)〈component〉 ::= (〈name〉,〈opt_in f orm_descr〉,〈opt_sites_de f 〉,

〈opt_states_de f 〉,〈opt_comparts_de f 〉,〈initial_quantity〉)〈reaction〉 ::= (〈id〉,〈react_type〉,〈rate〉)

〈proc〉 ::= Process 〈opt_in f orm_descr〉〈events_list〉〈events_list〉 ::= 〈event〉

| 〈event〉〈events_list〉〈event〉 ::= (〈id〉,〈 f orm_descr〉,〈react_id〉,〈opt_altern_event〉)

〈opt_sites_de f 〉 ::=

| 〈sites_de f 〉〈sites_de f 〉 ::= 〈site_de f 〉

| 〈site_de f 〉;〈sites_de f 〉〈site_de f 〉 ::= 〈name〉 : 〈state_name〉 : 〈is_active〉

〈opt_states_de f 〉 ::=

| 〈states_de f 〉〈states_de f 〉 ::= 〈state_de f 〉

| 〈state_de f 〉;〈states_de f 〉〈state_de f 〉 ::= 〈state_name〉 : 〈is_active〉

〈opt_comparts_de f 〉 ::=

| 〈comparts_de f 〉〈comparts_de f 〉 ::= 〈compart_de f 〉

| 〈compart_de f 〉;〈comparts_de f 〉〈compart_de f 〉 ::= 〈id〉 : 〈is_active〉

〈initial_quantity〉 ::= (〈quantity〉,〈opt_reliability〉)〈rate〉 ::= rate_const

| rate_law

〈rate_const〉 ::= (〈rate_value〉,〈opt_unit〉,〈opt_reliability〉)〈rate_law〉 ::= fMA(quantity)

| fMM(quantity,quantity)

| fH(quantity,quantity, Int)

〈 f orm_descr〉 ::= 〈event_descr〉| if 〈conds〉 then 〈event_descr〉

〈conds〉 ::= 〈cond〉| 〈cond〉 and 〈conds〉

〈cond〉 ::= 〈names〉 is 〈state_name〉| 〈names〉 is not 〈state_name〉| 〈names〉 is in 〈id〉| 〈names〉 is not in 〈id〉

〈names〉 ::= 〈name〉| 〈name〉.〈name〉| 〈name〉;〈names〉| 〈name〉.〈name〉;〈names〉

〈sites〉 ::= 〈name〉| 〈name〉;〈sites〉

| 〈complex_name〉 degrades 〈complex_name〉| 〈complex_name〉 synthesises 〈complex_name〉| 〈complex_name〉 homodimerizes

| 〈complex_name〉 dehomodimerizes

| 〈complex_name〉 dimerizes with 〈complex_name〉| 〈complex_name〉 dedimerizes from 〈complex_name〉

〈complex_name〉 ::= 〈name〉| 〈name〉 : 〈complex_name〉

〈id〉 ::= Int

〈opt_size〉 ::=

| Int|const

〈opt_unit〉 ::=

〈opt_dim〉 ::=

〈name〉 ::= Ide

〈opt_in f orm_descr〉 ::=

〈quantity〉 ::= value | const

〈value〉 ::= Int | Real

〈const〉 ::= Ide

〈opt_reliability〉 ::=

〈rate_value〉 ::= quantity

〈react_id〉 ::= Int

〈opt_altern_event〉 ::=

| alternative to 〈id〉〈is_active〉 ::= Bool

〈compart_name〉 ::= nucleus | cytosol | exosol

| cellMembrane | nucleusMembrane | Ide

〈react_type〉 ::= phosphorylation | dephosphorylation

| binding | unbinding

| homodimerization | dehomodimerization

| dimerization | dedimerization

| activation | deactivation

| hydrolysis | dehydrolysis

| degradation | synthesis | relocation

〈state_name〉 ::= phosphorylated | bound | active | hydrolysed | dimer

〈bimol_react〉 ::= phosphorylates | dephosphorylates | binds | unbinds

| activates | deactivates | hydrolyses | dehydrolyses

〈monomol_react〉 ::= phosphorylates | dephosphorylates | hydrolyses | dehydrolyses

From a Narrative Language for biology to Bio-PEPA · From a Narrative Language for biology to...

Documents