of 15
7/31/2019 A Hidden Markov Model _ Goutsias
1/15
A Hidden Markov Model forTranscriptional Regulation in Single Cells
John Goutsias
AbstractWe discuss several issues pertaining to the use of stochastic biochemical systems for modeling transcriptional regulation in
single cells. By appropriately choosing the system state, we can model transcriptional regulation by a hidden Markov model (HMM).
This opens the possibility of using well-known techniques for the statistical analysis and stochastic control of HMMs to mathematically
and computationally study transcriptional regulation in single cells. Unfortunately, in all but a few simple cases, analytical
characterization of the statistical behavior of the proposed HMM is not possible. Moreover, analysis by Monte Carlo simulation is
computationally cumbersome. We discuss several techniques for approximating the HMM by one that is more tractable. We employ
simulations, based on a biologically relevant transcriptional regulatory system, to show the relative merits and limitations of various
approximation techniques and provide general guidelines for their use.
Index TermsHidden Markov models, Monte Carlo simulation, stochastic biochemical systems, stochastic dynamical systems,
transcriptional regulation, transcriptional regulatory systems.
1 INTRODUCTION
TRANSCRIPTIONAL regulation is a fundamental biologicalprocess used by cells to control their actions andproperties through protein synthesis. Transcription mapsgenetic information encoded in a DNA molecule into RNAmolecules, which are then used for protein synthesis bytranslation. Understanding transcriptional regulation isfundamental to cell biology and may eventually lead tonovel techniques for the prevention and treatment ofhuman diseases [1], [2]. Since some readers may not befamiliar with basic cell biology, we provide a simple
introduction to transcriptional regulation in Section 2. Formore information, we refer the reader to [3], [4].Most work on transcriptional regulation requires ex-
tensive biological experimentation, which is time consum-ing and expensive. However, it is becoming increasinglyclear that mathematical modeling of transcriptional regula-tion may lead to inexpensive computational tools that canbe used to understand and predict basic principles under-lying this important biological process and guide biologicalexperimentation via simulation [5], [6].
We may consider transcriptional regulation in a largepopulation of cells [7] or in single cells [8], [9]. In the formercase, we may construct a model to predict the dynamicevolutions of the concentrations of molecular species in the
population. In the latter case, we may construct a model topredict the dynamic evolutions of various statistics (e.g.,means and standard deviations) of the copy numbers of eachmolecular species in a single cell. To model transcriptionalregulation in a population of cells, we need to assume that alarge number of genotypically identical cells are available
[7], which express the same set of genes using identicalmolecular machineries. Unfortunately, we cannot satisfythis assumption in practice. Moreover, the averaging effectof studying transcriptional regulation in a large populationof cells may mask important biological behavior and maylead to false conclusions. Therefore, it is more appropriateto study transcriptional regulation in single cells [10].
Due to the fact that biochemical reactions in single cellsmay be initiated by molecular collisions at random times,fluctuations may dominate transcriptional regulation dy-
namics [11], [12], [13]. This necessitates the use of a stochasticapproach to transcriptional regulation [8], [9], [11], [14]. Todevelop such an approach, we may assume that a cell is awell-stirred homogeneous medium at thermal equilibrium,comprised of a number of interacting molecules. Newmolecules are synthesized by biochemical reactions in-itiated at random times by stochastic interactions amongexisting molecules. This simplified view allows us to modeltranscriptional regulation in single cells by a mathemati-cally tractable stochastic biochemical system. We discuss thisapproach in Section 3.
Our main objective in this paper is to discuss severalimportant issues pertaining to the use of stochastic
biochemical systems for modeling transcriptional regula-tion in single cells. It is most common in the literature tocharacterize the state of a stochastic biochemical system bythe vector Xt of the copy numbers of the molecularspecies present in the system at time t. Then, a continuous-time vector-valued Markov chain (and, more precisely, abirth-death process) is used to characterize the dynamicevolution of that state. In Section 3, we argue that it may bemore appropriate to characterize the state of a stochasticbiochemical system by the vector Zt of the numbers ofoccurrences of the underlying reactions, from which thecopy numbers of the molecular species may be directlycalculated. In this case, a continuous-time vector-valued
Markov chain (and, more precisely, a birth process) is used to
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006 57
. The author is with the Whitaker Biomedical Engineering Institute, ClarkHall 308A, The Johns Hopkins University, Baltimore, MD 21218.E-mail: [email protected].
Manuscript received 28 Mar. 2005; revised 21 July 2005; accepted 15 Aug.2005; published online 31 Jan. 2006.For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TCBB-0018-0305.1545-5963/06/$20.00 2006 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
7/31/2019 A Hidden Markov Model _ Goutsias
2/15
characterize the dynamic evolution of that state. Unfortu-nately, we cannot measure the state Zt or calculate it fromXt. Therefore, we may model transcriptional regulationby a hidden Markov model (HMM) [15], with Z
t
being thehidden state and Xtbeing the observable state of that model.This opens the possibility of using well-known techniquesfor the statistical analysis and stochastic control of HMMs tomathematically and computationally study transcriptionalregulation in single cells.
In all but a few simple cases, we cannot analyticallycharacterize the statistical behavior of the hidden andobservable states. However, we can use Monte Carlotechniques to stochastically simulate the system andestimate relevant statistics. Unfortunately, this is a compu-tationally intensive approach. In Section 4, we discussseveral techniques that can be used to approximate thedynamic evolutions of the hidden states. Some techniqueshave been used in the literature to approximate thedynamic evolutions of the observable states. Simulations,based on a biologically relevant transcriptional regulatorysystem, clearly show the relative merits and limitations ofvarious approximations. It turns out that some techniquesmay not be appropriate, whereas others may produceexcellent approximations. Finally, we summarize ourconclusions in Section 5.
2 TRANSCRIPTIONAL REGULATION
Transcription and translation are two important biologicalmechanisms used by cells for protein synthesis. Duringtranscription, the DNA coding region of a gene is copiedinto messenger RNA (mRNA) molecules. A gene is usuallyassociated with two DNA regions, known as the regulatoryregion and the promoter of the gene. Proteins, known as
transcription factors (TFs), bind at specific sites along theregulatory region of a gene and recruit a large enzyme, theRNA polymerase II, at the promoter of the gene. Afterbinding at the promoter, the RNA polymerase II locallyseparates the two DNA strands and transcribes the gene bymoving along one of the strands. The TFs regulatetranscription by either activating or repressing the recruit-ment and binding of the RNA polymerase II at the promoter
of the gene.
During translation, the information encoded in mRNAmolecules is used for protein synthesis. This is done by alarge molecular complex, the ribosome. After binding anmRNA molecule, the ribosome converts the encodedgenetic information into one of 20 amino acids andchemically links these amino acids to form a protein.
mRNAs and proteins may be subject to degradation.
Proteins may also be subject to chemical modifications andprocessing (e.g., dimerization, cleavage, phosphorylation,etc.). These steps may alter mRNA and protein activity andexert additional control on transcriptional regulation.
To illustrate the previous steps, we refer to the one-generegulatory system depicted in Fig. 1.1 The genes regulatoryregion consists of two distinct binding sites, R1 and R2.Moreover,its promotercoincides withR2.TheTF D maybindat site R1 and, at sufficiently high concentrations, may alsobind at site R2. The binding ofD at R1 activates transcriptionof the gene by recruiting the RNA polymerase II at thepromoter. Activation of transcription produces mRNAtranscripts that are translated into proteins M. After
synthesis, two M molecules may bind (i.e., dimerize) toform a stable TF molecule D. These steps form a positivefeedback loop that, if left unchecked, may produce aninfinite number of proteins M and TFs D. However, sincethe number ofD molecules increases as a function of time, aD molecule will eventually bind at site R2. This will excludeRNA polymerase II from binding at the promoter, in whichcase, transcription will be repressed. The resulting negativefeedback will eventually stabilize protein synthesis at somedesired level.
The reader should keep in mind that transcriptionalregulation is controlled by several additional and not well-understood biological mechanisms, such as mRNA and
protein localization, alternative splicing, protein folding,and chromatin modification and remodeling [3], [4]. Bylimiting ourselves to the previously discussed mechanisms,we obtain an approximation that allows us to design simpleand tractable mathematical models for transcriptionalregulation. We discuss one such model next.
3 A HIDDEN MARKOV MODEL
3.1 Stochastic Biochemical Systems
A stochastic biochemical system consists of M elementary(monomolecular or bimolecular) irreversible reaction chan-nels, which react at random times. A monomolecularreaction channel converts a reactant molecule into one ormore product molecules. A bimolecular reaction channelconverts two reactant molecules into one or more productmolecules. We can decompose a reaction channel thatinvolves more than two reactant molecules into a cascade ofelementary reaction channels and model a reversiblereaction channel by two irreversible reaction channels.
We characterize the state of a stochastic biochemicalsystem at time t by the M-dimensional random vector
58 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
1. This is a simplified version of a basic biological mechanism of a geneticswitch that controls the fate of an E. coli cell infected by the bacteriophage virus [16]. This mechanism controls transcriptional regulation of thebacteriophage repressor protein (CI), responsible for maintaining apassive integration of the chromosome into the host DNA, a state known
as lysogeny.
Fig. 1. A simple transcriptional regulatory system.
7/31/2019 A Hidden Markov Model _ Goutsias
3/15
Zt Z1t Z2t ZMtT, w he re Zmt z, if themth reaction has occurred z times during the timeinterval 0; t and T denotes vector or matrix transposi-tion. The random variable Zmt is referred to as thedegree of advancement (DA) of the mth reaction [17]. In thefollowing, we denote by Xnt the number of moleculesof the nth reactant or product species present in the
system at time t. By assuming N distinct species, we setXt X1tX2t XNtT.
Given that the biochemical system is at state Xt x attime t, let qmx be the number of all possible distinctcombinations of the reactant molecules associated with themth reaction channel when the system is at state x. Note that
qmx
xi; for monomolecular reactionsxixi 1=2; for bimolecular reactions
with identical reactants
xixj; for bimolecular reactions withdifferent reactants;
8>>>>>>>:1
for some 1 i; j N, i 6 j. Moreover, let cm > 0 be theprobability per unit time that a randomly chosen combina-tion of reactant molecules will react through themth reaction channel. This probability is known as thespecific probability rate constant of the mth reaction. Then, theprobability that one mth reaction will occur during a timeinterval t; t dt will approximately be equal to mxdt,for a sufficiently small dt, where
mx 4 cmqmx; m 2 M 4 f1; 2; . . . ; Mg;is known as the propensity function of the mth reactionchannel [18], [19].
Note that, given the state zt of the biochemical systemat time t, we can uniquely determine the state xt of thesystem at time t. This is due to the fact that
Xnt gnZt 4 x0;n X
m2MsnmZmt; t ! 0; 2
for n 2 N 4 f1; 2; . . . ; Ng, where x0;n is the initial number ofmolecules of the nth species present in the cell at time t 0and snm is the stoichiometric coefficient. This coefficientquantifies the change in the number of molecules of thenth molecular species caused by one occurrence of themth reaction. The statezt cannot be determined fromxt ingeneral since there might be several stateszt that lead to thesame statext. To distinguishZt fromXt,werefertoZtas the hidden state and to Xt as the observable state.
The discrete-valued random process Z fZt; t ! 0gcharacterizes the dynamic evolution of the hidden state of abiochemical system. This process is specified by the prob-ability mass function (PMF) Pzz; t PrZt z j Z0 0,for every t ! 0. Simple probabilistic arguments show thatPzz; t satisfies the following first-order differential equa-tion [20]:
@Pzz; t@t
X
m2Mmz emPzz em; t mzPzz; t;
3
for t > 0, with initial condition Pz0; 0 1, where em is themth column of the M M identity matrix and
mz 4 mgz cmqmgz; 4with gz g1zg2z gNzT. This is the well-knownforward Kolmogorov differential equation [21], [22], [23] that
governs the stochastic evolution of a continuous-timeMarkov chain. However, in computational biochemistry,
(3) is referred to as the chemical master equation (CME) [17], a
term that we also use in this paper. It turns out that Z is a
multivariate birth process [21], [23].
We can show from (3) that the means mt 4 EZmtand covariances mm0t 4 CovZmt; Zm0t of the hiddenstate process, satisfy the following system of first-order
differential equations:2
dmtdt
EmZt; m 2 M; 5
dmm0tdt
EmZtm m0 EZmtm0Zt mtEm0Zt EZm0tmZt m0tEmZt; m ; m0 2 M;
6
where 0 1 and m 0, for m 6 0. Note that the timederivatives dmt=dt, m 2 M, given by (5), define thereaction rates of the reactions in M. These derivatives arealso known as (time-dependent) fluxes [24].
By following probabilistic arguments similar to the ones
that lead to (3), we can show that the PMF Pxx; t PrXt x j X0 x0 of the observable state processX
fX
t
; t
!0
gsatisfies the following CME [18]:
@Pxx; t@t
X
m2Mmx smPxx sm; t mxPxx; t;
7for t > 0, with Pxx0; 0 1, where sm s1ms2m sNm T isthe N-dimensional vector of the stoichiometric coefficients
associated with the mth reaction. In this case, X is a
multivariate birth-death process [21], [23].In most publications (except [20], [25]), only the
molecular population process X is used to characterizethe state of a stochastic biochemical system. In certain cases,however, we must also use the DA process Z. For example,
we may want to evaluate the efficiency of a transcriptionalreaction by calculating the average number of mRNAmolecules synthesized during a given time period or themean waiting time between successive occurrences ofmRNA synthesis events. Since we cannot, in general,evaluate these quantities analytically, we must repeatedlysample the hidden states of the biochemical system and usethe resulting DA trajectories to obtain Monte Carloestimates of these quantities. Another important use of theDA process comes from our need to elucidate thebiochemical mechanisms of transcriptional regulation and
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 59
2. Although most statistical quantities used in this paper depend on theinitial conditions Z0 0 or X0 x0, for simplicity, we do not show thisdependence in our formulation.
7/31/2019 A Hidden Markov Model _ Goutsias
4/15
investigate how these mechanisms affect cellular function. Ithas been noted in [24] that a promising approach to thisproblem is to develop a quantitative methodology thatallows us to systematically study how various reactions in atranscriptional pathway determine the molecular popula-tion dynamics and fluxes. We believe that, in view of thefact that the observable system dynamics are determined by
a (linear) superposition of individual hidden state dynamics(recall (2)), the DA process will play a key role inconstructing such a methodology. Finally, there might becases in which we can only specify a stochastic biochemicalsystem by a CME over the DA process. This is true inSection 4.5, where we approximate a stochastic biochemicalsystem that contains slow and fast reactions by one thatcontains only slow reactions. It turns out that the approx-imating system can only be specified by a CME similar to(3); see (25) below. Since we cannot determine theDA process Z from the molecular population process X,we must characterize a stochastic biochemical system byusing both states.
It is clear from the previous discussion that we can use
the following HMM to characterize a stochastic biochemicalsystem:
zt $ Pzz; t hidden state; 8
xt gzt observable state; 9
yt $ pyjxyt j xt measurements; 10where pyjx j in (10) is the conditional probability densityfunction of obtaining measurements yt of the observablesystem state xt. Since, in this paper, we are not interestedin modeling the measurement process, we focus our
attention on (8) and (9).
3.2 Simulation
Except in simple cases, it is not possible to analyticallyderive a solution to the CME (3). However, it is possible tosimulate the dynamics of the HMM (8), (9) by an exactstochastic simulation algorithm, known as the Gillespiealgorithm [26], and estimate relevant statistics (e.g., means,variances, and PMFs) by Monte Carlo simulation [27].
The Gillespie algorithm, applied to (3), generates a
sample trajectory, fzt; t ! 0g, by following two steps.First, given the hidden state zt of the biochemical systemat time t, the time t of the next reaction is determined bydrawing a sample from the exponential distribution:
T; t 0zt e0zt; ! 0;where
0z 4X
m2Mmz:
Then, the choice of the next reaction is determined bydrawing a sample from the PMF:
Rm; t mzt0zt ; m 2 M;
and the DA of that reaction is increased by one.
The previous algorithm is referred to as the direct
Gillespie algorithm, to distinguish it from another variation
known as the first reaction method [28]. Unfortunately, both
versions of the Gillespie algorithm are computationally
intensive, especially when applied to large and highly
reactive biochemical systems. Recent attempts to accelerate
the Gillespie algorithm have produced a number of
refinements [29], [30], [31], [32], [33], [34], [35], [36].
However, these algorithms remain computationally inten-
sive as biochemical systems become progressively more
complex.
In the following section, we discuss techniques that,
under specific assumptions, can be effectively used to
approximate the dynamic evolution of the hidden state
Zt. These techniques lead to a more efficient implementa-tion of the Gillespie algorithm and, under additional
assumptions, allow us to analytically approximate the
solution of the CME (3) by a multivariate normal distribu-
tion whose means and covariances are calculated by
recursively solving a system of first-order ordinary differ-ential equations.
3.3 Example
We will be illustrating various concepts and techniques
discussed in this paper by using the simple transcriptional
regulatory system depicted in Fig. 1. This system consists of
N 6 molecular species which react in accordance with M10 reactions. We summarize these reactions in Table 1 and
provide biologically relevant values for the associated
specific probability rate constants, obtained from our work
in [25]. The first reaction models translation of mRNA into
protein M, whereas reaction 3 models transcription. Reac-
tions 2 and 4 model the degradation of M and mRNA,
respectively. Reactions 5-8 model dimer/DNA binding/
unbinding. Finally, reactions 9 and 10 model dimerization of
M to D. Note that reactions 1-4, 6, 8, and 10 are mono-
molecular, whereas reactions 5, 7, and 9 are bimolecular.We initialize the system with two monomers and four
dimers and assume two DNA templates per cell. In this case:
1z c1z3 z42z c22 z1 z2 2z9 2z103z c3z5 z6 z7 z84z c4z3 z45z c54 z5 z6 z7 z8 z9 z102 z5 z66z c6z5 z6 z7 z87z c74 z5 z6 z7 z8 z9 z10
z5 z6 z7 z88z c8z7 z89z c92 z1 z2 2z9 2z10
1 z1 z2 2z9 2z10=210z c104 z5 z6 z7 z8 z9 z10:
11
60 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
7/31/2019 A Hidden Markov Model _ Goutsias
5/15
Moreover, (2) results in:
X1t
2
Z1
t
Z2
t
2Z9
t
2Z10
t
X2t 4 Z5t Z6t Z7t Z8t Z9t Z10tX3t Z3t Z4tX4t 2 Z5t Z6tX5t Z5t Z6t Z7t Z8tX6t Z7t Z8t;
12where the correspondence between Xn and a particular
molecular species is depicted in Table 1.In Fig. 2, we depict typical realizations of the dynamic
evolutions of some hidden and some observable states of
the transcriptional regulatory system depicted in Fig. 1.
These realizations have been obtained by the exact simula-tion method of Gillespie, applied on (3), during a 35 minute
period (a typical time between successive divisions of E. colicells). We also depict the dynamic evolutions of the means
and standard deviations about the means, estimated by
Monte Carlo simulation of 1; 000 sample trajectories. More-
over, in Fig. 3a, we depict the PMFs of the monomers,
dimers, and mRNA transcripts at time t 10 min,estimated by the same Monte Carlo simulation.
Initially, there are no DNA templates that are bound at
both regulatory sites R1 and R2 by D, in which case, activetranscription of the gene takes place. The resulting positive
feedback sustains mRNA synthesis, which results in a
gradual increase of monomer M and dimer D molecules.Eventually, dimers D bind at both regulatory sites, in whichcase, transcription is effectively repressed. The resultingnegative feedback gradually represses mRNA synthesis.The number of mRNA molecules present in the cell reachesa maximum of eight molecules. Subsequently, all mRNAmolecules are consumed by degradation. Overall, positive
feedback gradually increases the population of dimers,which is then stabilized by negative feedback. The simula-tions depicted in Fig. 2 were coded in Matlab and took, onaverage, 60 sec of CPU time per sample trajectory on a2GHz Xeon PC running Windows 2000.3
4 APPROXIMATIONS
We mentioned in the previous section that, in most cases, itis not possible to derive an analytical solution of the CME.Instead, we need to use the Gillespie algorithm, inconjunction with Monte Carlo simulation techniques, tostochastically simulate the CME (3) and estimate hidden
and observable state statistics (e.g., means and variances). Itturns out that this approach is computationally intensive.For example, the simulations depicted in Fig. 2 and Fig. 3atook about 16 hrs. of CPU time. For this reason, it is veryimportant to approximate the CME (3) by a more tractableequation. In this section, we present a number of approx-imations and discuss their relative merits and limitations.
4.1 Langevin Approximation
A useful approximation to the CME (3) is obtained byassuming that there exists a time step dt such that thefollowing two conditions are satisfied:
C1. Changes in the hidden system states that occur during
any time interval t; t dt do not appreciably affectthe propensity functions mz, m 2 M.
C2. The expected number of occurrences of each reaction in atime interval t; t dt is much larger than one.
It can be shown that, under both conditions, the dynamicevolution of the hidden state process Z is governed by thefollowing system of stochastic differential equations [19]:
dZmt mZtdt ffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffi
mZtp
dWmt; m 2 M; 13for t > 0, where fWm; m 2 Mg are mutually independentstandard Brownian motions, which are also independent ofZ.Each equation in (13) is a Langevin equation [17].
The system (13) can be numerically solved by discretiz-
ing time at equally spaced points kdt, k 0; 1; . . . , and byintegrating (13) using the well-known Euler-Maruyamamethod [37]. Because of condition C1, this leads to thefollowing iterations:
Zmk 1dt Zmkdt mZkdtdtffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffi
mZkdtdtp
Nm; k 0; 1; . . . ; m 2 M;14
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 61
TABLE 1Reactions Associated with the Transcriptional Regulatory
System Depicted in Fig. 1
3. The reader can find our code as supplementary material, which can beaccessed on the Computer Society Digital Library at http://computer.org/tcbb/archives.htm. The cited CPU times are not the best possible since ourcode is not optimized. However, they provide a clear distinction between
the computational requirements of the techniques discussed in this paper.
7/31/2019 A Hidden Markov Model _ Goutsias
6/15
initialized by Zm0 0, for every m 2 M, where fNm; m 2Mg are mutually independent zero mean Gaussian randomvariables with unit variance, which are also independent ofZ. We refer to the resulting approximation technique as theLangevin approximation (LA) method. A similar approxima-tion, applied on the observable states Xt, has beenextensively used for modeling biological fluctuations insingle cells (e.g., see [8], [38], [39], [40], [41], [42]).
In many cases, we may not be able to simultaneouslysatisfy the previous conditions. Referring to the transcrip-tional regulatory system depicted in Fig. 1, we may pick asufficiently small value for dt so that the propensityfunctions do not appreciably change during any timeinterval t; t dt, thus satisfying condition C1. However,transcription and translation are slow reactions, whichmeans that they will occur infrequently during the timeinterval t; t dt, as compared to other reactions. In thiscase, condition C2 will not be satisfied for the chosen valueof dt and the LA method may fail to provide a satisfactoryapproximation.
We illustrate this problem in Fig. 4, where we depict
typical realizations of the dynamic evolutions of somehidden and some observable states of the transcriptionalregulatory system depicted in Fig. 1, obtained by (14) and(2), with dt 0:05 s. We also depict the dynamic evolutionsof their means and standard deviations about the means,estimated by Monte Carlo simulation of 1,000 sampletrajectories. Moreover, in Fig. 3b, we depict the PMFs ofthe monomers, dimers, and mRNA transcripts at timet 10 min, estimated by the same Monte Carlo simulation.
The sample trajectories of the numbers of reactions andmRNA transcripts depicted in Fig. 4 are notsatisfactory sincethey do not follow the integer-valued, step-like behavior ofthe actual trajectories (compare with Fig. 2). This is due to the
fact that these reactions occur infrequently, in which case,
we cannot simultaneously satisfy conditions C1 and C2.However, the LA method results in very good MonteCarlo estimates for the means, standard deviations, andPMFs. This is due to the fact that, in the limit as dt 0,(14) implies that the hidden state means and covariances ofthe approximating system satisfy the same system ofdifferential equations as the original system, given by (5),(6). Therefore, the LA method always provides an exactmatch of the first and second-order statistics of Z for asufficiently small time step dt.
On the average, it took about 10 sec of CPU time to samplethe system states, which is six times faster than the CPU timerequired by the exact method. Note that the computationalsavings obtained by using the LA method are moderate. Thisis due to the fact that, to satisfy condition C1, we need tochoose a rather small time step, dt, which leads to a largenumber of iterations (42; 000 iterations).
4.2 Linear Noise Approximation
Unfortunately, the LA method does not allow us to obtainan expression for the joint probability density function (PDF)
pZz; t of the hidden states. However, by using additionalapproximations, we can characterize the hidden states by amultivariate Gaussian PDF eppZz; t, given byeppZz; t 12M=2VjQQtj1=2
expn
12V
z V tTQQ1tz V to;
15for t > 0, with mean vector V t and covariance matrixVQQt, where V is the cellular volume and jQQj denotes thedeterminant of matrix QQ. For completeness, we outline the
mathematical steps that lead to (15) in the Appendix.
62 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Fig. 2. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines), obtained by Monte Carlo simulation of the
CME (3) based on the exact simulation method.
7/31/2019 A Hidden Markov Model _ Goutsias
7/15
The vector t satisf71ies the following system of first-order ordinary differential equations:
dmtdt
emt; t > 0; m 2 M; 16where
emz 4 1V
mVz: 17
Moreover, QQt satisfies the matrix Riccati differential
equation
dQQtdt
BBtQQt QQtBBTt CCt; t > 0; 18
where BBt is an M M matrix with elements bmm0t andCCt is an M M diagonal matrix with elements cmt,given by
bmm0t @emt@zm0
and cmt emt: 19Thesystem(16)and(18)canbesolvednumerically(e.g.,by
the standard Euler method) to provide an approximation to
the dynamic evolutions of the DA means and covariances.
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 63
Fig. 3. PMF estimates of monomer, dimer, and mRNA transcript distributions in the transcriptional regulatory system depicted in Fig. 1 at time
t 10 min obtained by: (a) exact simulation, (b) Langevin approximation, (c) Poisson approximation, (d) mean-field approximation, and (e) quasi-equilibrium approximation.
7/31/2019 A Hidden Markov Model _ Goutsias
8/15
This approach is substantially faster than Monte Carlosimulations and can be used to provide a rapid assess-ment of the statistical behavior of the HMM (8), (9).
For reasons we explain in the Appendix, we refer to theresulting technique as the linear noise approximation (LNA)method. A detailed description of how this method can beused in certain biochemical systems can be found in [43],
[44]. The LNA method conveniently characterizes the HMM(8), (9) by the multivariate Gaussian PDF (15), which isdetermined by the system of first-order differential equa-tions (16) and (18). In this case, and from (2), Xnt is alinear combination of Gaussian DAs. Therefore, the ob-servable state Xt will follow a multivariate Gaussiandistribution as well.
The LNA method is based on (A.4), which is obtained bylinearizing the propensity function mZt about the meanvector t EZt. However, when the derivatives@2m=@zm@zm0 and the covariances mm0 are not negli-gible, then (A.4) may not hold. Another problem is that theLNA method is obtained from a Langevin approximation in
the limit as the system volume V tends to infinity. However,V is a biological parameter that cannot be artificiallyincreased to improve the accuracy of the LNA method.Finally, since the LNA method is obtained from theLA method by additional approximations, it suffers fromsimilar drawbacks. For these reasons, we need moreaccurate and versatile approximation techniques. In viewof the approximation techniques discussed in Sections 4.3and 4.4 below, we believe that there is no advantage inusing the LNA method.
4.3 Poisson Approximation
A better approximation of the HMM (8), (9) may be obtained
by employing a time step dt that satisfies condition C1, but
may not necessarily satisfy condition C2. Since reactionsthat occur during the time interval kdt; k 1dt will notappreciably change the values of the propensity func-tions, given the DA values at time kdt, these reactions willoccur independently of each other. Moreover, the numberof occurrences of the mth reaction during kdt; k 1dtwill be a Poisson random variable with parameter
mzkdtdt [19]. In this case, (14) becomesZmk 1dt Zmkdt RmmZkdtdt;
k 0; 1; . . . ; m 2 M; 20
initialized by Zm0 0, for every m 2 M. Given theDA values at time kdt, Rmmzkdtdt, m 2 M, aremutually independent Poisson random variables withparameters mzkdtdt, m 2 M, respectively. We refer tothe resulting approximation as the Poisson approximation(PA) method.
The PA method has been recently used to developcomputational improvements of the stochastic simulation
algorithm of Gillespie [30], [31], [32], [34], [35], [36]. Notethat, in the limit as dt 0, (20) implies that the hidden statemeans and covariances of the approximating system satisfythe same differential equations as the original system, givenby (5), (6). Therefore, the PA method always provides anexact match of the first and second-order statistics ofZ for asufficiently small time step dt. But, most importantly, it mayresult in a better approximation than the LA method.
In Fig. 5, we depict typical realizations of the dynamicevolutions of some hidden and some observable states ofthe transcriptional regulatory system depicted in Fig. 1,obtained by (20) and (2), with dt 0:05 s. We also depict thedynamic evolutions of the means and standard deviations
about the means, estimated by Monte Carlo simulation of
64 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Fig. 4. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines), obtained by Monte Carlo simulation based
on the Langevin approximation method.
7/31/2019 A Hidden Markov Model _ Goutsias
9/15
1,000 sample trajectories. Moreover, in Fig. 3c, we depict the
PMFs of the monomers, dimers, and mRNA transcripts at
time t 10 min, estimated by the same Monte Carlosimulation. Similarly to the LA method, it took about
12 sec of CPU time on average to sample the system states.The PA method produces very good approximations of
the dynamic evolutions of the hidden and observable states,accurately preserving the discrete nature of these states.
Moreover, the method results in very good Monte Carlo
estimates for the means, standard deviations, and PMFs, as
expected. Therefore, we believe that this method should be
preferred over the LA method.
4.4 Mean-Field Approximation
Similarly to the LA method, the PA method does not allow
us to derive an expression for the joint PMF Pzz; t of thehidden states. However, we show in the Appendix that we
can approximately characterize the hidden states by a PMFePPZz; t, given byePPZz; t 1
t expn
12
z etT eIRIR1tz eto;
21for t > 0, where
t 4Xz>0
expn
12
z etT eIRIR1tz eto: 22
In (21), (22), the elements of the mean vector et andcovariance matrix
eIRIRt satisfy the following first-order
differential equations:
demtdt
met 12
XMk;l1
d2;mkleklt; m 2 M; 23demm0t
dt met 1
2
XMk;l1
d2;mkleklt" #
m m0
XMk1
d1;mktem0kt d1;m0ktemkt; m ; m0 2 M;24
where
d1;mkt 4 @met@zk
and d2;mkl 4 @2met
@zk@zl:
Note that d2;mkl does not depend on t. Moreover, ePPZz; t is anormal Gibbs distribution at temperature 2=kB, with energy
function z etT eIRIR1tz et and partition functiont, where kB is Boltzmanns constant.4
For reasons we explain in the Appendix, we refer to the
resulting technique as the mean-field approximation (MFA)
method. This method conveniently characterizes the stochas-
tic biochemical system by the dynamic evolution of the
normal Gibbs distribution (21), (22), which is determined by
the system of coupled first-order differential equations (23),
(24). From (2), we may approximate the PMF PXx; t by a
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 65
Fig. 5. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines), obtained by Monte Carlo simulation based
on the Poisson approximation method.
4. The Gibbs distribution (21), (22) may be approximated by a sampled
Gaussian distribution, in which case t 2M=2j eIRIRtj1=2. However, theaccuracy of this approximation depends on the values of e and eIRIR and maynot always be acceptable. For example, in the univariate case, when e 100and e 40, the values of the sampled Gaussian distribution are 0.6 percentsmaller than the values of the Gibbs distribution, but, when
e
e 30, this
error increases to 15.5 percent.
7/31/2019 A Hidden Markov Model _ Goutsias
10/15
normal Gibbs distribution ePPXx; t with means and covar-iances given by
eEEXnt x0;n Xm2M
snmemt;
gCovCovXnt; Xn0t
Xm;m02Msnmsn0m0
emm0t;
for n; n0 2 N.In Fig. 6, we depict dynamic evolutions of the means and
standard deviations of some hidden and some observablestates of the transcriptional regulatory system depicted inFig. 1, approximated by the MFA method. The means emt,m 2 M, and covariances emm0t, m; m0 2 M, are calculatedby using Eulers method, with dt 0:05 s, to recursivelysolve (23), (24). These quantities are superimposed on thestate realizations depicted in Fig. 5. In Fig. 3d, we depict themarginal normal Gibbs approximations ePPXnx; t of thePMFs of the monomers, dimers, and mRNA transcripts attime t
10 min, where
ePPXnx; t 1nt exp x eEEXnth i22gVarVarXnt
8>:9>=>;; t > 0;
with
nt 4Xx!0
exp x eEEXnt2
2gVarVarXnt( )
:
It took about 16 seconds of CPU time to obtain the dynamicevolution of the means and standard deviations depicted inFig. 6, which is about 750 times faster than the Monte Carlo
approach based on the PA method.
By comparing the results depicted in Fig. 5 and Fig. 6 and,more specifically, the evolution of the standard deviationassociated with the reaction DNA D D DNA 2D, wesee that we may need to increase the accuracy of the MFAmethod in certain cases. As we explain in the Appendix,this may be accomplished by including higher-order (! 3)moments in the differential equation (24). Such an inclusion
will, however, result in increasing the complexity of themethod.
Although the MFA method may produce results that arenot as accurate as the ones obtained by Monte Carloestimation, this method is very attractive since, similarly tothe LNA method, it may be used to provide a rapidassessment of the statistical behavior of a biochemicalsystem. Moreover, this method is superior to the LNAmethod for three main reasons: 1) It is based on the moreaccurate Poisson approximation, 2) its approximationaccuracy does not depend on the cellular volume, and 3) itdoes not require linearization of the underlying propensityfunctions.
4.5 Stochastic Quasi-Equilibrium ApproximationMost often, reactions occur on vastly different time scales.For example, the transcription and translation reactionsdepicted in Fig. 1 are typically slow reactions, whereasdimerization is a fast reaction. This means that transcrip-tion and translation may occur infrequently, whereas,dimerization may occur numerous times within successiveoccurrences of slow reactions. In such cases, the Gillespiealgorithm will be spending the most time simulating fastreaction events. It may, however, be less important to knowthe activity of fast reactions in detail since the systemsdynamic evolution may be mostly determined by theactivity of the slow reactions. This is illustrated by thesimulation results depicted in Fig. 2, which show that it
may not be important to know the detailed dynamic
66 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Fig. 6. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted inFig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines). The state evolutions have been obtained bythe Poisson approximation method, whereas the evolutions of the means and standard deviations have been obtained by the mean-field
approximation method.
7/31/2019 A Hidden Markov Model _ Goutsias
11/15
evolution of the monomer and dimer states since the largefluctuations underlying these states do not seem to affectthe remaining states. Therefore, we may be able toapproximate the CME (3) by one that involves only slowreactions.
This idea has recently been explored by severalinvestigators, who have proposed a number of techniquesfor eliminating fast reaction kinetics [20], [25], [45], [46]. The
techniques proposed in [20], [25] are based on the CME (3),whereas the techniques proposed in [45], [46] are based onthe CME (7). Since our interest here focuses on the CME (3),we briefly discuss the quasi-equilibrium approximationtechnique proposed by us in [25].
In the following, we assume that the first M0 reactions ofa biochemical system are slow, whereas the remainingM M0 reactions are fast. We set
Zt ZstZft !
; z zszf
!;
em em
0
!; m 2 Ms; and em
0
em
!; m 2 Mf;
where Ms 4 f1; 2; . . . ; M0g and
Mf 4 fM0 1; M0 2; . . . ; Mg;with Zst, zs, and em being M0-dimensional vectors, andZft, zf, and em being M M0-dimensional vectors. From(3), we can show that the marginal PMF Pzzs; t of theslow reactions satisfies the following CME [25]:
@Pzzs; t@t
X
m2Mstm zs emPzzs em; t
tm zsPzzs; t;25
where
tm zs 4Xzf
mzs; zfPzf j zs; t; m 2 Ms; 26
with Pzf j zs; t being the conditional probability of thefastDAsattime t,giventhestateoftheslowreactionsat t.
If we focus our interest on stochastic biochemicalsystems for which the slow propensity functions dependlinearly on fast DAs (which is the case for the transcrip-tional regulatory system depicted in Fig. 1), we can showthat [25]
tm zs mzs; zs; t; m 2 Ms; 27
where zs; t 4 M01zs; tM02zs; t Mzs; tT
, withmzs; t 4 EZmt j Zst zs, m 2 Mf, being the meanDA of the mth fast reaction at time t, given the state zs ofthe slow reactions at t. Equations (25), (26) show that wecan approximate the biochemical system by one that iscomprised of only slow reactions, provided that we cancalculate the conditional mean DAs of the fast reactions,given the hidden states of the slow reactions. In this case,the fast reactions will exert their influence on the slowreactions by means of their conditional mean DAs throughthe propensity functions of the slow reactions.
Given that Zst zs, the optimal mean-square estimate
bxxnt of the observable system state Xnt of the originalbiochemical system will be given by (recall (2)):
bxxnt EXnt j Zst zs x0;n
Xm2Ms
snmzmt X
m2Mfsnmmzs; t; 28
for n 2 N. Therefore, the original biochemical system can beapproximated by onewhosehidden state is governed by (25),whereas its observable state is given by (28). This leads to the
following approximation of the state-space model (8), (9):zst $ Pzzs; t hidden state
bxxt bggzst observable state;where
bggnzst x0;n Xm2Ms
snmzmt X
m2Mfsnmmzs; t:
From (25) and (26), we can show that the means andcovariances of the slow hidden states satisfy the samedifferential equations as the original system, given by (5),
(6). Note also that, if the nth observable state is not affected
by a fast reaction, then (28) implies that bXXnt Xnt.For all other states, EbXXnt EXnt, for t ! 0, butCovbXXnt; bXXn0t 6 CovXnt; Xn0t. Therefore, the obser-vable states of the approximating and original systems thatare not affected by any fast reaction are identical. Themean values of the remaining states are also identical, but
their covariances may be different.To calculate the conditional mean DAs of the fast
reactions required by (25), we have proposed in [25] aquasi-equilibrium approach, based on a principle ofstatistical microscopic reversibility, according to which thefast reactions are assumed to rapidly reach a state ofequilibrium between consecutive occurrences of slow
reactions. We illustrate this approach by using the exampledepicted in Fig. 1.Since dimerization is a fast reaction, we set zs
z1 z2 z3 z4 z5 z6 z7 z8T and zf z9 z10T. In this case, (11)and (27) result in
t1 z c1z3 z4
t2 z c2
2 z1 z2 29zs; t 210zs; t
t3 z c3z5 z6 z7 z8
t4 z c4z3 z4
t5 z c5
4 z5 z6 z7 z8 9zs; t
10
zs; t
2 z5 z6
t6 z c6z5 z6 z7 z8
t7 z c7
4 z5 z6 z7 z8 9zs; t 10zs; t
z5 z6 z7 z8
t8 z c8z7 z8:
29
Moreover,
bXX1t 2 Z1t Z2t 29Zs; t 210Zs; t 30
bXX2t 4 Z5t Z6t Z7t Z8t
9
Zs; t
10
Zs; t
;
31
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 67
7/31/2019 A Hidden Markov Model _ Goutsias
12/15
with bXXnt Xnt, for n 3; 4; 5; 6, where Xnt is givenby (12). Since the 10th reaction is the reverse of the9th reaction, it is expected that, between successiveoccurrences of slow reactions, dimerization will rapidlyreach equilibrium. At equilibrium, given the state zs of theslow reactions, the probability that a reaction 9 will occurwithin the time interval t; t dt will approximately equalthe probability that a reaction 10 will occur within the nexttime interval t dt;t 2dt. Otherwise, dimerization mayeventually result in an unreasonably large number ofdimers (if the first probability is larger than the second)or no dimers at all (if the second probability is largerthan the first), conditions that will be harmful to the cell.This implies that 9zs;Zf e9 10zs;Zf. However,since the value of Zf rapidly becomes large, a slightchange in Zf will not affect the value of the propensityfunction. Therefore, we can approximately assume that9zs;Zf 10zs;Zf. This equality leads to [25]:
9zs; t 10zs; t 12
Azs ffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffiA2zs 4Bzsph i
; 32
within successive occurrences of slow reactions, where
Azs 2 z1 z2 c102c9
12
; 33
Bzs 142 z1 z22 z1 z2 1
c102c9
4 z5 z6 z7 z8:34
We refer to the resulting approximation as the stochasticquasi-equilibrium approximation (SQEA) method.
In general, we may not be able to obtain a CME for the
molecular population process bXX of a stochastic biochemical
system obtained by SQEA. The CME (7) is derived byrelating the molecular population process Xt dt at timet dt to Xt [18]. This is possible due to the linearrelationship between Xt and Zt, given by (2), whichimplies that Xt dt Xt snm, if the mth reactionoccurs during the time interval t; t dt. However, we maynot be able to relate bXXt dt and bXXt since bXXt may be anonlinear function of the DA process Zt, e.g., see (30)-(34).
In Fig. 7, we depict typical realizations, obtained by thedirect Gillespie algorithm, of the dynamic evolutions of somehidden and some observable states of the transcriptionalregulatory system depicted in Fig. 1, approximated by theCME (25) and(28),withslow propensity functionsgivenby(29) and (32)-(34). We also depict the dynamic evolutions oftheir means and standard deviations about the means,estimated by Monte Carlo simulation of 1; 000 sampletrajectories. Moreover, in Fig. 3e, we depict the PMFs of themonomers,dimers,and mRNAtranscriptsat timet 10 min,estimated bythe same Monte Carlo simulation. On average, ittook about 0.5 sec of CPU time to sample the system states,
which is 120 times faster than the exact method and 24 timesfaster than the PA method.
The results depicted in Fig. 7 show that the SQEA methodproduces a relatively smooth approximation of thedynamic evolution of the number of monomers and dimers.This is expected since bXX1t and bXX2t are the conditionalexpectations ofX1t and X2t, respectively, conditioned onthe state of the slow reactions. Note, however, that the useof the SQEA method is based on the premise that we are notinterested in the detailed evolutions of the observable statesX1t and X2t. On the other hand, the SQEA methodprovides very good approximations of the remaining statesand rapidly produces good Monte Carlo estimates for the
means, standard deviations, and PMFs.
68 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Fig. 7. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines), obtained by Monte Carlo simulation based
on the stochastic quasi-equilibrium approximation method.
7/31/2019 A Hidden Markov Model _ Goutsias
13/15
5 CONCLUSION
In this paper, we introduced an HMM for transcriptionalregulation in single cells. The reaction DAs are used as thehidden states of the model, whereas the molecular popula-tions are used as the observable states. The dynamicevolution of the hidden states is characterized by theCME (3), whereas the observable states are directly
calculated from the hidden states by means of (2).Unfortunately, analytical characterization of the pro-
posed HMM is not possible. A Monte Carlo simulationapproach, based on the Gillespie algorithm, can be used toestimate various statistics. This approach is computation-ally intensive and often not practical. Therefore, we areforced to seek analytical and computationally tractableapproximations of the original HMM. We presented severalapproximation techniques, including the LA, PA, andSQEA methods. Moreover, we discussed the LNA andMFA methods, which approximate the state probabilities bymultivariate normal distributions. We pointed out that theLA and LNA methods should not be used unless the twoconditions required for their validity can be verified.
If we can separate the reactions of a transcriptionalregulatory system into slow and fast reactions, we mayuse the SQEA method to simplify the system. This methodeliminates the fast reactions, provided that we are notinterested in such reactions. We can also use an HMM tocharacterize the resulting slow reaction subsystem. We canstudy its dynamical behavior by employing a Monte Carlosimulation approach based on the Gillespie algorithm. If thisapproach turns out to be computationally intensive, we mayuse the MFA method for a rapid assessment and the PAmethod for a more precise assessment of the statisticalbehavior of the simplified system.
Although we have focused our discussion on modelingtranscriptional regulation in single cells, the techniquespresented in this paper are general enough to be applied to
other types of stochastic biochemical system, such assignaling and metabolic networks.
APPENDIX
Linear Noise Approximation. The PDF pZz; t of thehidden state vector Zt, governed by the system ofLangevin equations (13), satisfies the following nonlinearFokker-Planck equation [17], [19]:
@pZz; t@t
X
m2M
@
mzpZz; t
@zm 1
2
@2
mzpZz; t
@z2m;
A:1for t > 0. Unfortunately, pZz; t cannot be determined sincefinding a solution to this equation is as difficult as finding asolution to the original CME (3). However, we canapproximate (A.1) by the linear Fokker-Planck equation(A.7) below, whose solution leads to the Gaussian PDFeppZz; t, given by (15).
Indeed, let tbe a noise process with PDF p; t, suchthat
Zt V t ffiffiffiffi
Vp
t; t > 0; A:2where t satisfies (16) and V is the cellular volume. ATaylor series expansion of the propensity function mZtabout the mean vector t
EZt results in
mZt mt Zt tT @mt@z
12Zt tT @
2mt@z2
Zt t;A:3
where @gz=@z and @2gz=@z2 denote the gradient andHessian of g, respectively. From (1), (2), and (4), note thatthe derivatives of m
z
with respect to z of order greater
than 2 are zero. Therefore, (A.3) is exact. By takingexpectation on both sides of (A.3) and by assuming thatthe third term on the right-hand side is negligible (thus, ateach time t, linearizing the propensity functions mz,m 2 M, about the mean vector t), we approximatelyobtain
EmZt mt; t > 0; m 2 M: A:4Then, (5), (16), (17), (A.2), and (A.4) imply that Et 0,t > 0. Note that
p; t pZV t ffiffiffiffi
Vp
; t; A:5
whereas1
VmV t
ffiffiffiffiV
p emt V1=2
emt 1ffiffiffiffiV
p T demtdz
O1=V:A:6
The first equality in (A.6) is due to (17), whereas the secondequality is due to a Taylor series expansion of emz aboutt. By using (A.5) and (A.6), we can expand (A.1) in termsof order V1=2, V0, V1=2, etc. It turns out that the term oforder V1=2 is zero due to (16). Moreover, for a sufficientlylarge volume V, we may ignore all terms of order other thanV0, in which case, the PDF p; t will approximatelysatisfy the following linear multivariate Fokker-Planckequation:
@p; t@t
X
m2M
Xm02M
bmm0t @m0p; t@m
12
Xm2M
cmt @2p; t
@2m; t > 0;
A:7
where bmm0t and cmt are given by (19).It can be verified that the solution of the previous
equation is a multivariate Gaussian distribution with zeromean and covariance matrix QQ that satisfies the matrixRiccati equation (18) [17]. Equation (A.5) implies that
pZz; t p z V tffiffiffiffiV
p ; t :Therefore, the PDF pZz; t can be approximated by themultivariate Gaussian PDF eppZz; t, given by (15). Becausethis technique is based on linearizing, at each time t, thepropensity functions mz, m 2 M, about the mean vectort and, since the DAs Zt are modeled by the mean-plus-noise model in (A.2), it is referred to as the linear noiseapproximation (LNA) method [17].
Mean-field Approximation. To derive an expression forthe PMF PZz; t, we may approximate the stochasticbiochemical system with one whose hidden state vector
eZZt satisfies (compare with (20)):
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 69
7/31/2019 A Hidden Markov Model _ Goutsias
14/15
eZZk 1dt eZZkdt k; dt; k 0; 1; . . . ; A:8where k; dt, k 0; 1; . . . , are mutually independentdiscrete-valued random vectors, with k; dt beingindependent of eZZkdt, for every k 1; 2; . . . . We maythen determine and the PMF
ePPZz; t of
eZZt so that
the means emt 4
E
eZZmt, m 2 M, and covariancesemm0t 4 Cov eZZmt; eZZm0t, satisfy the same differentialequations as the actual system; in particular (recall (5)
and (6)),
demtdt
EmeZZt; m 2 M; A:9demm0t
dt EmeZZtm m0
E eZZmtm0eZZt emtEm0eZZt E
eZZm0tm
eZZt
em0tEm
eZZt; m ; m0 2 M:
A:10
Finally, we may set PZz; t ePPZz; t.
From (A.8), note that
eZZkdt Xk1l0
l; dt; k 1; 2; . . . :
Therefore, eZZkdt is the sum of k independent discrete-valued random vectors. Under certain general conditions,
the central limit theorem implies that, for sufficiently large k,eZZkdt will be governed by a multivariate sampledGaussian distribution (e.g., see [23, p. 279]). By setting
kdt
t, we obtain (21), (22).From a Taylor series expansion of the propensity
function meZZt about the mean vector et, we have that(recall (A.3))
EmeZZt met 12
XMk;l1
@2met@zk@zl
eklt; A:11
E eZZm0tmeZZt metem0t XMk1
@met@zk
em0ktem0t
2 XM
k;l
1
@2m
et
@zk@zl eklt;
A:12where we set the third-order central moments of eZZt in(A.12) equal to zero. Equations (A.11) and (A.12), together
with (A.9) and (A.10), result in (23), (24). Note that the
second-order derivatives of mz with respect to z willeither be 0 or constant. These derivatives are therefore
independent of t.From (A.8), and since k; dt is independent of eZZkdt,
we have that
E
m
k; dt
emk 1dt emkdt; m 2 M; A:13
Covmk; dt; m0k; dt emm0k 1dt emm0kdt;m; m0 2 M:
A:14Because of (23), (24), (A.11), (A.13), and (A.14), we can set
mk; dt Rmk Wm; A:15
where Rmk is a Poisson random variable with parameterk EmeZZkdtdt and Wm, m 2 M, are zero-meanrandom variables, with Wm being independent of Rm0 ,
m0 2 M, such thatCovWm; Wm0
XMk1
@met@zk
em0kt @m0et@zk
emkt !dt: A:16In this case, the number of occurrences of the mth reaction
during a time interval kdt; k 1dt follows a Poissondistribution with parameter Em
eZZkdtdt, instead of
meZZkdtdt. To compensate for errors introduced by thisapproximation, we add a zero-mean correction term Wm to
the Poisson random variable Rm. The covariances ofWm are
chosen so that (A.8) and (A.15) imply (A.9), (A.10).In other words, we here assume that the most important
influence on the firing rate of a given reaction is exerted bythe mean propensity function of that reaction through aPoisson process. The additive correction term compensatesfor statistical variations not accounted for by the Poissonprocess. This approach is a type of mean-field approximation,similar to the one used in statistical mechanics [47], [48],which employs a linear correction term to compensate forerrors in the approximation. For this reason, we refer to thistechnique as the mean-field approximation (MFA) method.
Note that the MFA method discussed above is based onthe assumption that the third-order central moments ofeZZtare zero; otherwise, the right-hand-side of (A.12) mustinclude a fourth term, which is a function of those moments.This assumption allows calculation of the dynamic evolu-tions of the means and covariances of eZZ by means of thesystem of differential equations (A.9), (A.10). For a moreaccurate approximation, we may include higher-order (! 3)central moments in the formulation. In this case, mk; dtwill still be given by (A.15), but the covariances of Wm willbe given by a more complicated formula than (A.16), interms of higher-order central moments of
eZZ. These mo-
ments can be calculated by differential equations that are
similar to, albeit more complex than, (A.9) and (A.10).
REFERENCES[1] M.V. Karamouzis, V.G. Gorgoulis, and A.G. Papavassiliou,
Transcription Factors and Neoplasia: Vistas in Novel DrugDesign, Clinical Cancer Research, vol. 8, pp. 949-961, 2002.
[2] L. Hood, J.R. Heath, M.E. Phelps, and B. Lin, Systems Biologyand New Technologies Enable Predictive and Preventive Medi-cine, Science, vol. 306, pp. 640-643, 2004.
[3] M. Ptashne and A. Gann, Genes & Signals. Cold Spring Harbor,N.Y.: Cold Spring Harbor Laboratory Press, 2002.
[4] H. Lodish, A. Berk, P. Matsudaira, C.A. Kaiser, M. Krieger, M.P.Scott, S.L. Zipursky, and J. Darnell, Molecular Cell Biology, fifth ed.New York: W.H. Freeman and Company, 2003.
[5] D. Endy and R. Brent, Modelling Cellular Behaviour, Nature,
vol. 409, pp. 391-395, 2001.
70 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
7/31/2019 A Hidden Markov Model _ Goutsias
15/15
[6] H. Kitano, Computational Systems Biology, Nature, vol. 420,pp. 206-210, 2002.
[7] J. Goutsias and S. Kim, A Nonlinear Discrete Dynamical Modelfor Transcriptional Regulation: Construction and Properties,Biophysical J., vol. 86, pp. 1922-1945, 2004.
[8] C.V. Rao, D.M. Wolf, and A.P. Arkin, Control, Exploitation andTolerance of Intracelluar Noise, Nature, vol. 420, pp. 231-237,2002.
[9] J.M. Raser and E.K. OShea, Control of Stochasticity in Eukaryotic
Gene Expression, Science, vol. 304, pp. 1811-1814, 2004.[10] J.M. Levsky and R.H. Singer, Gene Expression and the Myth of
the Average Cell, Trends in Cell Biology, vol. 13, no. 1, pp. 4-6,2003.
[11] H.H. McAdams and A. Arkin, Stochastic Mechanisms in GeneExpression, Proc. Natl Academy of Science, vol. 94, pp. 814-819,1997.
[12] H.H. McAdams and A. Arkin, Its a Noisy Business! GeneticRegulation at the Nanomolar Scale, Trends in Genetics, vol. 15,no. 2, pp. 65-69, 1999.
[13] N. Fedoroff and W. Fontana, Small Numbers of Big Molecules,Science, vol. 297, pp. 1129-1131, 2002.
[14] M.B. Elowitz, A.J. Levine, E.D. Siggia, and P.S. Swain, StochasticGene Expression in a Single Cell, Science, vol. 297, pp. 1183-1186,2002.
[15] W.J. Ewens and G.R. Grant, Statistical Methods in Bioinformatics: AnIntroduction. New York: Springer, 2001.
[16] M. Ptashne, A Genetic Switch: Phage Lambda Revisited, third ed.Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory Press,2004.
[17] N.G. vanKampen, Stochastic Processes in Physics and Chemistry.Amsterdam: Elsevier, 1992.
[18] D.T. Gillespie, A Rigorous Derivation of the Chemical MasterEquation, Physica A, vol. 188, pp. 404-425, 1992.
[19] D.T. Gillespie, The Chemical Langevin Equation, J. ChemicalPhysics, vol. 113, no. 1, pp. 297-306, 2000.
[20] E.L. Haseltine and J.B. Rawlings, Approximate Simulation ofCoupled Fast and Slow Reactions for Stochastic ChemicalKinetics, J. Chemical Physics, vol. 117, no. 15, pp. 6959-6969, 2002.
[21] S. Karlin and H.M. Taylor, A First Course in Stochastic Processes,second ed. San Diego, Calif.: Academic Press, 1975.
[22] S. Karlin and H.M. Taylor, A Second Course in Stochastic Processes.San Diego, Calif.: Academic Press, 1981.
[23] A. Papoulis and S.U. Pillai, Probability, Random Variables andStochastic Processes, fourth ed. New York: McGraw-Hill, 2002.
[24] R. Heinrich and S. Schuster, The Regulation of Cellular Systems.New York: Chapman & Hall, 1996.
[25] J. Goutsias, Quasiequilibrium Approximation of Fast ReactionKinetics in Stochastic Biochemical Systems, J. Chemical Physics,vol. 122, 184102, 2005.
[26] D.T. Gillespie, Exact Stochastic Simulation of Coupled ChemicalReactions, J. Physical Chemistry, vol. 81, no. 25, pp. 2340-2361,1977.
[27] J.S. Liu, Monte Carlo Strategies in Scientific Computing. New York:Springer-Verlag, 2001.
[28] D.T. Gillespie, General Method for Numerically SimulatingStochastic Time Evolution of Coupled-Chemical Reactions,
J. Computational Physics, vol. 22, pp. 403-434, 1976.[29] M.A. Gibson and J. Bruck, Efficient Exact Stochastic Simulation
of Chemical Systems with Many Species and Many Channels,
J. Physical Chemistry A, vol. 104, pp. 1876-1889, 2000.[30] D.T. Gillespie, Approximate Accelerated Stochastic Simulation of
Chemically Reacting Systems, J. Chemical Physics, vol. 115, no. 4,pp. 1716-1733, 2001.
[31] D.T. Gillespie and L.R. Petzold, Improved Leap-Size Selection forAccelerated Stochastic Simulation, J. Chemical Physics, vol. 119,no. 16, pp. 8229-8234, 2003.
[32] M. Rathinam, L.R. Petzold, Y. Cao, and D.T. Gillespie, Stiffness inStochastic Chemically Reacting Systems: The Implicit Tau-Leap-ing Method, J. Chemical Physics, vol. 119, no. 24, pp. 12784-12794,2003.
[33] Y. Cao, H. Li, and L. Petzold, Efficient Formulation of theStochastic Simulation Algorithm for Chemically Reacting Sys-tems, J. Chemical Physics, vol. 121, no. 9, pp. 4059-4067, 2004.
[34] T. Tian and K. Burrage, Bionomial Leap Methods for SimulatingStochastic Chemical Kinetics, J. Chemical Physics, vol. 121, no. 21,
[35] A. Chatterjee, D.G. Vlachos, and M.A. Katsoulakis, BinomialDistribution Based -Leap Accelerated Stochastic Simulation,
J. Chemical Physics, vol. 122, 024112, 2005.[36] Y. Cao, D.T. Gillespie, and L.R. Petzold, Avoiding Negative
Populations in Explicit Poisson Tau-Leaping, J. Chemical Physics,vol. 123, 054104, 2005.
[37] D.J. Higham, An Algebraic Introduction to Numerical Simulationof Stochastic Differential Equations, SIAM Rev., vol. 43, no. 3,pp. 525-546, 2001.
[38] J. Hasty, J. Pradines, M. Dolnik, and J.J. Collins, Noise-BasedSwitches and Amplifiers for Gene Expression, Proc. NatlAcademy of Science, vol. 97, no. 5, pp. 2075-2080, 2000.
[39] E.M. Ozbudak, M. Thattai, I. Kurtser, A.D. Grossman, and A. vanOudenaarden, Regulation of Noise in the Expression of a SingleGene, Nature Genetics, vol. 31, pp. 69-73, 2002.
[40] J.R. Pirone and T.C. Elston, Fluctuations in Transcription FactorBinding Can Explain the Graded and Binary Responses Observedin Inducible Gene Expression, J. Theoretical Biology, vol. 226,pp. 111-121, 2004.
[41] R. Steuer, Effects of Stochastisity in Models of the Cell Cycle:From Quantized Cycle Times to Noise-Induced Oscillations,
J. Theoretical Biology, vol. 228, pp. 293-301, 2004.[42] M.L. Simpson, C.D. Cox, and G.S. Sayler, Frequency Domain
Chemical Langevin Analysis of Stochasticity in Gene Transcrip-tional Regulation, J. Theoretical Biology, vol. 229, pp. 383-394, 2004.
[43] J. Elf and M. Ehrenberg, Fast Evaluation of Fluctuations in
Biochemical Networks with the Linear Noise Approximation,Genome Research, vol. 13, pp. 2475-2484, 2003.[44] R. Tomioka, H. Kimura, T.J. Kobayashi, and K. Aihara, Multi-
variate Analysis of Noise in Genetic Regulatory Networks,J. Theoretical Biology, vol. 229, pp. 501-521, 2004.
[45] C.V. Rao and A.P. Arkin, Stochastic Chemical Kinetics and theQuasi-Steady-State Assumption: Application to the GillespieAlgorithm, J. Chemical Physics, vol. 118, no. 11, pp. 4999-5010,2003.
[46] Y. Cao, D.T. Gillespie, and L.R. Petzold, The Slow-ScaleStochastic Simulation Algorithm, J. Chemical Physics, vol. 122,014116, 2005.
[47] G. Parisi, Statistical Field Theory. Redwood City, Calif.: Addison-Wesley, 1988.
[48] J.J. Binney, N.J. Dowrick, A.J. Fisher, and M.E. J. Newman, TheTheory of Critical Phenomena: An Introduction to the RenormalizationGroup. Oxford, U.K.: Oxford Univ. Press, 1992.
John Goutsias received the Diploma degree inelectrical engineering from the National Techni-cal University of Athens, Greece, in 1981, andthe MS and PhD degrees in electrical engineer-ing from the University of Southern California,Los Angeles, in 1982 and 1986, respectively. In1986, he joined the Department of Electrical andComputer Engineering at The Johns HopkinsUniversity, Baltimore, Maryland, where he iscurrently a professor of electrical and computer
engineering, a Whitaker Biomedical Engineering Professor, and aprofessor of applied mathematics and statistics. His research interestsinclude signal processing and analysis, computational biology, andbioinformatics. Dr. Goutsias served as an associate editor for the IEEETransactions on Signal Processing (1991-1993) and the IEEE Transac-tions on Image Processing (1995-1997). He is currently an area editor
for the Journal of Visual Communication and Image Representation, acoeditor for the Journal of Mathematical Imaging and Vision, and anassociate editor for the IEEE Transactions on Pattern Analysis andMachine Intelligence and the EURASIP Journal on Bioinformatics andSystems Biology. He is a senior member of the IEEE.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 71