Compact Extreme Learning Machines for biological systems · Compact Extreme Learning Machines for...

112 Int. J. Computational Biology and Drug Design, Vol. 3, No. 2, 2010

Copyright © 2010 Inderscience Enterprises Ltd.

Compact Extreme Learning Machines for biological systems

Kang Li* and Jing Deng School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, BT9 5AH, UK E-mail: [email protected] E-mail: [email protected] *Corresponding author

Hai-Bo He Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881, USA E-mail: [email protected]

Da-Jun Du School of Mechatronical Engineering and Automation, Shanghai University, Shanghai 200072, China E-mail: [email protected]

Abstract: In biological system modelling using data-driven black-box methods, it is essential to effectively and efficiently produce a parsimonious model to represent the system behaviour. The Extreme Learning Machine (ELM) is a recent development in fast learning paradigms. However, the derived model is not necessarily sparse. In this paper, an improved ELM is investigated, aiming to obtain a more compact model without significantly increasing the overall computational complexity. This is achieved by associating each model term to a regularized parameter, thus insignificant ones are automatically unselected, leading to improved model sparsity. Experimental results on biochemical data confirm its effectiveness.

Keywords: ELM; extreme learning machine; fast recursive algorithm; linear-in-the-parameter; local regularisation; radial basis function; MAPK; SOS response.

Reference to this paper should be made as follows: Li, K., Deng, J., He, H.B. and Du, D.J. (2010) ‘Compact Extreme Learning Machines for biological systems’, Int. J. Computational Biology and Drug Design, Vol. 3, No. 2, pp.112–132.

Compact Extreme Learning Machines for biological systems 113

Biographical notes: Kang Li is a Reader in Intelligent Systems and Control at Queen’s University Belfast. His research interests include advanced algorithms for training and construction of neural networks, fuzzy systems and support vector machines, as well as advanced evolutionary algorithms, with applications to non-linear system modelling and control, microarray data analysis, systems biology, environmental modelling and monitoring, and polymer extrusion. He has produced over 150 research papers and co-edited seven conference proceedings in his field. He is a Chartered Engineer, a member of the IEEE and the InstMC and the current Secretary of the IEEE UK and Republic of Ireland Section.

Jing Deng received his BSc from National University of Defence Technology, China, in 2001 and 2005, and his MSc from the Shanghai University, China, in 2005 and 2007. From March 2008, he started his PhD research in Intelligent Systems and Control (ISAC) Group at Queen’s University Belfast, UK. His main research interests include system modelling, pattern recognition and fault detection.

Hai-Bo He received the BS and MS in Electrical Engineering from Huazhong University of Science and Technology (HUST), Wuhan, China, in 1999 and 2002, respectively, and the PhD in Electrical Engineering from Ohio University, Athens, in 2006. He is currently an Assistant Professor in the Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, USA. His research interests include machine learning, data mining, computational intelligence, VLSI and FPGA design, and embedded intelligent systems design.

Da-Jun Du received the BSc and MSc from the Zhengzhou University, China, in 2002 and 2005, respectively, and PhD from Shanghai University, China, in 2010. From September 2008 to September 2009, he was a visiting PhD student at Intelligent Systems and Control (ISAC) Research Group at Queen’s University Belfast, UK. His main research interests include neural networks, system modelling and identification and networked control system.

1 Introduction

The phenotypic behaviour of living organisms is determined by the underlying and highly complex interactions of molecules, for example proteins, DNA, RNA or other biochemical substances (Kitano, 2002). These interactions can occur at an extremely fast rate and the overall dynamics of the cell or higher organism can exhibit high non-linearity. One of the challenges in systems biology is to utilise proven techniques that have been developed in other areas, such as control engineering, and apply these to biological systems to try to gain a better understanding of the function and behaviour of the underlying molecular processes (Wolkenhauer et al., 2003; Wang et al., 2008, 2009; Gormley et al., 2007).

In the systems biology literature, the most common approach to model these molecular interactions and signalling pathways is by ordinary/partial differential equations (Levchenko et al., 2000). These equations describe concentration levels of the individual molecular species in the pathway over time. In control engineering, this is commonly known as the first-principle transparent models, as they are based on the chemical rate equations of the underlying biological process to provide a complete

114 K. Li et al.

picture of the system at any time. Such models are perfectly feasible when the number of molecular species in the pathway is relatively small. In biological systems, the number of species interactions, however, can become incredibly large, resulting in the model becoming too complex to analyse and even impossible to solve.

As an alternative, data-driven identification of molecular interaction pathways using a black-box approach has recently been investigated by Gormley et al. (2007). This type of method creates relatively simple non-linear regression models based on the observation data, where the output of the model is a function of previous system states of interest. One of the main objectives in building black-box models, which in many cases are LITPs ones, is to produce a sparse model to effectively represent the system behaviour.

This class of non-linear model comprises a linear combination of some model terms or basis functions that are the functions of past system states and inputs of interest, and has been used to model a wide range of non-linear dynamic systems in the literature. Some examples include the polynomial non-linear AutoRegressive model with eXogenous inputs (polynomial NARX), Radial Basis Function (RBF) networks, etc. It has been shown that LITP models have broad approximation capabilities and have been widely used in modelling and control of complex non-linear engineering systems.

In non-linear system identification and neural network literature, parameter estimation and structure determination are critical issues. For LITP-TTs with additive nodes, the early estimation algorithms make use of the first derivative information, i.e., backpropagation with adaptive learning rate and momentum (BPAM), conjugate gradient, QuickProp, etc.(Bishop, 1995). The advanced training algorithms like the Levenberg-Marquardt (LM), which make use of the second derivative information, are proved to be more efficient and have been widely used in applications (Marquardt, 1963; Hagan and Menhaj, 1994).

For construction of LITP-TTs with RBF terms, the conventional approach takes a two-stage procedure, i.e., unsupervised estimation of both the centres and impact factors for the RBF terms and supervised estimation of the linear output parameters. With respect to the centre location, clustering techniques have been proposed by Sutanto et al. (1997) and Musavi et al. (1992). Once the centres and the impact factors are determined, the linear output parameters can be obtainable using popular matrix decomposition methods, such as Cholesky decomposition, Orthogonal Least Squares (OLSs) or Singular Value Decomposition (Press et al., 1992; Lawson and Hanson, 1974; Serre, 2002; Chen et al., 1991).

In addition to the parameter estimation, to control the model complexity of LITP-TTs is another important issue, and a number of additive (or constructive, or growing) methods, and subtractive (or destructive, or pruning) methods have been proposed (Mao and Huang, 2005; Chng et al., 1996; Miller, 1990; Platt, 1991; Kadirkamanathan and Niranjan, 1993; Yingwei et al., 1997; Huang et al., 2005). In particular, it has been shown that ones can add nodes to the network one by one until a stopping criterion is satisfied, thus obtaining more compact networks (Chen et al., 1991). The techniques they used are subset selection algorithms, where a small number of nodes are selected from a large pool of candidates based on the OLSs (Chen et al., 1991; Korenberg, 1988; Zhang and Billings, 1996; Mao and Huang, 2005) or the fast recursive algorithm (Li et al., 2005a). The elegance of the method is that the net contribution of the newly added node can be explicitly identified without solving the whole least square problem, leading to significantly reduced computational complexity.


Generally speaking, conventional construction and estimation of LITP-TTs can be extremely time consuming if both issues have to be considered simultaneously (Peng et al., 2006). Unlike the conventional learning schemes, a more general network learning concept, namely the ELM, was proposed for LITP-TTs (Huang et al., 2006b, 2006c; Liang et al., 2006a). For ELM, all the non-linear parameters in the model terms, including both the non-linear input parameters for LITP-TTs with additive nodes or both the centres and impact factors for LITP-TTs with RBF kernels, are randomly chosen independently of the training data. Thus, the training is transformed to a standard least squares problem, leading to a significant improvement of the generalisation performance at a fast learning speed. The output weights in the LITP-TTs are linearly analytically determined (Huang et al., 2006b, 2006c; Liang et al., 2006a). It has been proven in an incremental method that the LITP-TTs with randomly generated model terms independent of the training data according to any continuous sampling distribution can approximate any continuous target functions (Huang et al., 2006a). The non-linear function for additive nodes can be any bounded piecewise continuous functions and the non-linear functions for RBF terms can be any integrable piecewise continuous functions. Further, the ELM is extended to a much wider type of model terms, including fuzzy rules as well as additive nodes, etc. (Huang and Chen, 2007). The ELM has also been applied in some other areas like biomedical engineering (Xu et al., 2005), bioinformatics (Handoko et al., 2006; Zhang et al., 2006; Wang and Huang, 2005), Human-Computer Interface (HCI) (Liang et al., 2006b), text classification (Liu et al., 2005), terrain reconstruction (Yeu et al., 2006) and communication channel equalisation (Li et al., 2005b).

This paper investigates the development of compact ELMs for biological systems. This is achieved by incorporating regularised parameters into a forward recursive method, thus the significance of each model term can be revealed and a compact ELM can be obtained. The method is used for the model identification of the MAPK signal transduction pathway and the transcriptional network of SOS response to DNA damage in E. coli, and simulation results confirm the effectiveness.

2 Linear-in-the-parameter models with tunable terms and Extreme Learning Machine

In system identification, two major elements are essential:

• to determine the model structure describing the functional relationship between the system input and output variables

• to estimate the model parameters that specify any model given the chosen or derived model structure.

The natural approach to data-driven identification is to use linear difference equations to relate input and output observations. Since most practical systems are non-linear, non-linear models are often required. Defining the input of a non-linear discrete system as u(t), the system output as y(t), and given a training data set consisting of N input/output data pairs 1{ ( ), ( )} ,N

tu t y t = the fundamental goal in system identification is then to find

( ) ( ( ), ) ( )y t f t e tθ= +x (1)

116 K. Li et al.

where the underlying function f(.) is unknown, and θ is the vector of associated parameters, and e(t) is the noise, which is often assumed to be independent and identically distributed (i.i.d.) with constant variance σ2. The model input vector is formed using x(t) = [y(t – 1),…, y(t – ny),…, u(t – 1),…, u(t – nu), e(t – 1),…, e(t – ne)]T, and ny, nu and ne are lags of output, input and noise.

Suppose that the true underlying function f(.) is continuous and smooth, the problem of non-linear system identification can be regarded as a functional approximation problem for which an estimator f̂ of f is to be sought. Various non-linear model types may be chosen to approximate f, leading to different modelling paradigms. In approximation theory, a popular way of approximating functions is to use a linear regression of non-linear basis functions. These models represent non-linear input/output relationship with a LITPs structure given by

1

ˆ ( ( ), ) ( ( ))n

i ii

f t tθ φ θ=

=∑x x (2)

where φi(x(t)) is a known non-linear basis function mapping, θi are unknown parameters, and n is the number of basis functions in the model.

The LITP models are well structured for adaptive learning, have provable learning and convergence conditions with the capability of parallel processing, therefore many applications have been reported. In particular, LITP-TTs represent a wide class of flexible non-linear models and learning machines in the literature. For such models/learning machines, the ELM concept (Huang et al., 2006b, 2006c; Huang and Chen, 2007, 2008) is one of the recent developments, which facilitate extremely fast learning of data and will be discussed in the following.

An LITP-TT is a linear combination of some non-linear terms or basis functions that are usually non-linear functions of the inputs. These non-linear terms have tunable parameters (sometimes called non-linear parameters), thus making the model more flexible and powerful. Generally speaking, there are two major categories of non-linear terms/functions for LITP-TTs (Huang et al., 2006a, 2006c; Peng et al., 2008):

• LITP-TTs with additive terms

• LITP-TTs with RBFs.

An LITP-TT with n additive terms can be formulated by

1

( ) ( + )n

Ti i i i

i

y w bθ φ=

=∑x x (3)

where x ∈ ℜd is the input vector, wi∈ℜd is the input parameter vector connecting the inputs to the ith model term, bi∈ℜ is the extra DC parameter of the ith model term, θi is the parameter connecting the ith model term to the system output, and φi is the non-linear function of the model term.

The RBF models belong to a specific class of LITP-TTs, which use RBF as the model term, and each RBF has its own centre and impact factor with its output being some radially symmetric function of the distance between the input and the centre (Peng et al., 2007; Li et al., 2009):


1

( ) ( || ||)n

i i i ii

y θ φ γ=

= −∑x x c (4)

where x ∈ ℜd is again the system input vector, ci ∈ ℜd and γi ∈ ℜ are the centre and impact factor of the ith RBF term, θi is the parameter connecting the ith RBF term to the system output.

Now suppose N samples 1{( , )}Ni i it =x are used to identify model (4). Here, xi ∈ ℜd

is the input vector and ti ∈ ℜ is a target output. If an LITP-TT with n model terms can approximate these N samples with zero error, it implies that there exist θI and ωi such that

1

( ) ( ; ) , 1, , .n

j i i i j ji

y t j Nθ φ ω=

= = =∑x x … (5)

Equation (5) can be rewritten compactly as:

= Θ =y P t (6)

where

1 1 1

1 1

1

( , ) ( , )( , , , , , )

( , ) ( , )

i i n

n n

i N i n N

φ ω φ ωω ω

φ ω φ ω

=

x xP x x

x x… … (7)

1[ , , ] ,Tnθ θΘ = … (8)

1[ , , ]TNt t=t … (9)

[ ( ), , ( )] .Ty y= 1 Ny x x… (10)

It has been proven that the universal approximation capability of LITP-TTs with random model terms generated according to any continuous sampling distribution can approximate any continuous target function (Huang et al., 2006a; Huang and Chen, 2007). Since the model term parameters ωi of LITP-TTs do not need to be estimated or optimised and may simply be assigned with random values (Huang et al., 2006a, 2006c; Huang and Chen, 2007). Equation (6) then becomes a linear system and the linear output parameters Θ are estimated as:

†Θ = P P (11)

where P† is the Moore-Penrose generalised inverse (Rao and Mitra, 1971) of the regression matrix P.

The above-mentioned introduction reveals the three issues in the ELM that have to be dealt with in applications. These are

• ELM usually requires larger number of model terms, thus a compact network implementation is demanded

• a large network may lead to the singularity problem in the matrix computation, controlling the network complexity can also reduce or eliminate this problem

118 K. Li et al.

• large networks generate high computational overhead in deriving the output weights, a more efficient algorithm is needed.

In fact, these three problems are also closely coupled. If the performance of the network can be improved, the number of required model terms and the corresponding number of output weights can be significantly reduced, leading to the overall reduction of the computational complexity. Moreover, the improved network performance will eventually reduce or eliminate the singularity problem. Once PTP becomes non-singular, a number of efficient algorithms can be used for fast computation of the output parameters. The basic idea in this paper is to introduce individual regulariser to each of the model terms, and an integrated framework is used to improve the computational complexity.

3 Problem formulation

A generalised LITP-TT with M model terms can be represented by

1

ˆ( ) ( ( ))M

i ii

y t tφ θ=

=∑ x (12)

where ˆ( )y t is the model output for input vector x(t), φi(x(t)) is the ith model term/basis function, and θI is the linear parameter, and the non-linear parameters in φi(x(t)) are randomly assigned according to the ELM concept.

Define the modelling residual as ˆ( ) ( ) ( ),e t y t y t= − where y(t) is the actual system output. Then, equation (12) can be rewritten as

ˆ( ) ( ) ( ) ( ) ( )Ty t y t e t t e tφ= + = Θ + (13)

where Θ = [θ1, θ2,…,θM]T is the linear parameter vector and Φ(t) = [φ1(t), φ2(t),…,φm(t)]T is the ‘regressor’ vector.

Suppose a set of samples, denoted as 1{ ( ), ( )} ,NN tD x t y t == is used for modelling.

By taking the complete observational data set DN, equation (13) can be written in a matrix form

Y = ΦΘ + Ξ (14)

where Y = [y(1), y(2), …, y(N)]T is the desired output matrix, Ξ = [e(1), …, e(N)]T is the residual vector, and Φ = [φ1, …, φM]∈ℜN×M is the regression matrix, with column vectors φi = [φi(x(1)), …, φi(x(N))]T, i = 1, …, M.

If regularisation is introduced into the cost function J as

2

1

MT T T

i ii

J λθ=

= Ξ Ξ + = Ξ Ξ +∑ Θ ΛΘ (15)

where the regularisation parameter vector λ = [λ1, …, λM]T, and Λ = diag(λ1, λ2, …, λM). The least square estimate of linear output weights that minimises equation (15) is

given by 1ˆ ( ) .T TY−Θ = Φ Φ + Λ Φ (16)


Define a matrix

_ .TM = Φ Φ Λ (17)

Equation (16) can be rewritten as 1ˆ .TM Y−Θ = Φ (18)

The associated minimal cost function is then computed as 1ˆ( ) ( ) .T TJ W Y I M Y−= − Φ Φ (19)

The Bayesian evidence procedure (Mendes and Billings, 2001) can be readily used to optimise the regularisation parameters. The updated formulas for the regularisation parameters are given as:

new2 , 1

Ti

ii

i MN

γλγ θ

Ξ Ξ= ≤ ≤−

(20)

1, 1

M

i i i iii

Nγ γ γ λ=

= = −∑ (21)

where Nii is the ith diagonal element of M–1. To find a (local) optimal λi (1 ≤ i ≤ M) usually requires some iterations.

4 The method

The aim of the method proposed in the paper is to use the forward subset selection procedure to select a compact model. Suppose that k out of M model terms with randomly assigned non-linear parameters already been selected, as denoted by φ1, …, φk. The remaining model terms in the full regression matrix Φ = [φ1, …, φM] are denoted as φk+1, …, φM. According to equations (17) and (18), for an LITP-TT with k terms, it is obvious that

Tk k k kM = Φ Φ + Λ (22)

1ˆ Tk k kM Y−Θ = Φ (23)

where Φk = [ϕ1, …, ϕk], and Λk = diag{λ1, …, λk} contains the corresponding k regularisation parameters from the full regularisation parameter vector λ in terms of the original item index of φi, 1 ≤ i ≤ k, in Φ.

Noting equation (19), the regularised cost function expressed in equation (15) is then given by

1ˆ( ) ( ) .T Tk k k kJ Y I M Y−Θ = − Φ Φ (24)

If a new model term vector φj ∈ {φk+1, …, φM} is now selected, the selected regression matrix increases by one column, becoming Φk+1 = [Φk, φj]. The corresponding λj ∈ {λk+1, …, λM} is also selected for φj, which results in the new regularisation parameter matrix Λk+1 = diag{Λk, λj}. The regularised cost function is updated as

120 K. Li et al.

11 1 1 1

ˆ( ) ( ) .T Tk k k kJ Y I M Y−

+ + + +Θ = − Φ Φ (25)

The net contribution of φj as the (k + 1)th newly selected model term is expressed as

1 1ˆ ˆ( ) ( ) ( ).k j k kJ J Jφ+ +∆ = Θ − Θ (26)

To select a new model term, the contribution equation (26) has to be computed for each of the M – k remaining candidate centres as ∆Jk+1(φj), ∀φj ∈ {φk+1, …, φM). The one that gives the biggest net contribution in reducing the cost function is then chosen as the (k + 1)th model term. In this way, the LITP-TT model is constructed in a forward selection fashion, i.e., the model terms with randomly assigned parameters are selected into the final model one by one in terms of their net contributions. The model term selection procedure is terminated until some model complexity criterion is satisfied. This, for example, can be that a certain number, say, n, of model terms have been chosen when the Mean-Squared Error (MSE) is reduced to a given level ρ. The choice of ρ is vital. Too small ρ typically leads to over-fitting because of excessive model terms being selected to overfit noise. Too big ρ, on the other hand, leads to under-fitting, introducing large bias and some important model terms are missed out.

In the proposed method, the regularisation parameters λi, 1 ≤ i ≤ M, will be optimised every time the model structure is updated. Thus, unnecessary model terms will result in large λ during the iterative optimisation procedure for the regularisation parameters, which effectively force their coefficients to be zero. Therefore, it will become obvious on how many model terms can be included at the final model construction stage.

The n regularisation parameters can be updated as follows

new2 , 1

Ti

ii

i nN

γλγ θ

Ξ Ξ= ≤ ≤−

(27)

1, 1

n

i i i iii

Nγ γ γ λ=

= = −∑ (28)

where Nii is the ith diagonal element of 1.nM − To facilitate simple numerical implementation of the above-proposed approach,

a couple of issues will be addressed: first, updating the regularisation parameters λi (1 ≤ i ≤ M) involves the computation of 1,nM − which is computationally expensive.

This is relieved by using a recursive formula to be proposed in the following; second, updating equation (26) can be further simplified by introducing a regression context.

To compute the inverse of Mk+1, define (Li et al., 2010).

11

k kk T

k k

F gM

g f−+

∆

(29)

where Fk ∈ ℜk×k, gk ∈ ℜk×1 and fk ∈ ℜ. Then, define

, .j T j Tj j j k k jλφ φ φ λ φ∆ + Φ ∆Φ (30)


Equation (29) can be computed using the following recursive formula 1 1

11

( )( )

j j Tk k k k

k k j j T jk k k

M MF MMλφ

− −−

−

Φ Φ= +

− Φ Φ (31)

jk k

k j

FgλφΦ

= − (32)

2( ( ) ) /( ) .j j T j jk k k kf Fλ λφ φ= + Φ Φ (33)

Next, define a residual matrix series 1 1( )T T T

k k k k k k k k kR I I M− −∆ − Φ Φ Φ + Λ Φ = − Φ Φ (34)

where Φk, k = 1, …, M is of full-column rank, and 0 .R I∆

Then 1

1 1 1 1.T

k k k kR I M −+ + + += − Φ Φ (35)

Using equations (31)–(33), equation (35) can be further expressed in the following recursive formula.

1 .T

k j j kk k T

j k j j

R RR R

Rφ φ

φ φ λ+ = −+

(36)

Thus, using equation (36), equation (26) can be computed as

1 1

1

ˆ ˆ( ) ( ) ( )

.

k j k k

T Tk k

T Tk j j k

Tj k j j

J J J

Y R Y Y R Y

Y R R YR

φ

φ φφ φ λ

+ +

+

∆ = Θ − Θ

= −

=+

(37)

To further simplify the computation of equation (37), define φj ∈ {φk+1, …, φM}, then

.jk k jR R ϕ= (38)

According to equation (36), jkR can be recursively updated as

1 11

1

(( ) ).

k k Tk k jj j

k k T kk k k

R RR R

Rφ

φ λ− −

−−

= −+

(39)

Substituting equation (39) into equation (37), the net contribution of φj to the cost function can be explicitly expressed as

2

1( )( ) .

T jk

k j T jj k j

Y RJR

φφ λ+∆ =

+ (40)

Given the above-mentioned preparations, the procedure of selecting LITP-TT terms can be summarised as follows.

122 K. Li et al.

Step 1) Initialisation: Set λi, 1 ≤ i ≤ M, to some small positive value and the iteration loop It = 1.

Step 2) LITP-TT model construction:

A Set initial model size k = 0, and R0 = I, compute Tj jφ φ and 0 ,jR 1 ≤ j ≤ M.

B At the kth step where 1 ≤ k ≤ M, compute the contribution for the remaining candidate model terms φj (k ≤ j ≤ M) using equation (40).

C Find the one that gives the maximum contribution from ∆Jk(φj) (k ≤ j ≤ M).

D and update 11kM −

+ using equations (30)–(33), jkR by equation (39), and then increase

k = k + 1.

E Go to B) to add more model terms, until the construction criterion (ρ) is satisfied or a pre-set maximum model size is reached. This produces an LITP-TT model with n terms.

Step 3): Update λ using equations (27) and (28) with M = n. If λ remains sufficiently unchanged in two successive iterations, or a pre-set maximum iteration number is reached, stop; otherwise, set It = It + 1, and go to step 1.

5 Experimental results

The following sections provide a description of the steps taken to perform the identification of the MAPK signalling pathway and the transcriptional network of SOS response to DNA damage in E. coli.

5.1 The MAPK cascade

The MAPK cascade is an important intracellular signalling pathway that is involved in producing many different cellular responses, including cell growth and proliferation (Kholodenko, 2000). As such, it is an important pathway that can even be implicated in cancer development when its normal signalling process malfunctions. The pathway describes the response of a cell when it detects the binding of extracellular signalling molecules to receptor proteins at the surface of the cell membrane. The binding process results in conformational changes on the part of the receptor that is below the membrane surface, which in turn triggers the activation of a cascade of intracellular signalling proteins. This is a 3-tiered cascade where the kinase at each level is activated through dual phosphorylation at 2 amino acid sites by the activated kinase of the previous level (see Figure 1). At the end of the cascade, the terminal signalling protein activates target proteins, which alter the behaviour of the cell, for example, by regulating the expression of certain genes, by altering cell shape (by cytoskeletal proteins) or by changing cell metabolism (Alberts et al., 2002).


Figure 1 Kinetic pathway diagram of the MAPK cascade. The single and dual phosphorylation of each molecule is represented by the addition of a ‘-P’ and ‘-PP’, respectively, to the name of the kinase, where MAPK-PP represents the output-activated form of the kinase. Ras (or MKKKK) is the input protein that triggers the activation of the kinase at the top level of the cascade

5.1.1 Simulation of the MAPK cascade

To create a black-box model of the MAPK cascade, a set of input–output data is required to perform model estimation and validation. A simulation of the signalling pathway was performed to generate data sets. The mathematical model used for the simulation is based on one derived in Kholodenko (2000), which includes the addition of negative feedback. This is an 8th order state model with a single-input and single-output (SISO). The model uses Michaelis–Menten enzyme kinetics to derive chemical rate equations for each of the pathway connections in the cascade. The rate equations are given in Tables 1 and 2. After setting the initial concentrations of each species and rate constants, the physical equations can be solved for a particular time series.

Table 1 Kinetic rate equations for the concentrations of each of the 8 types of molecule found in the MAPK cascade

Source: Kholodenko (2000)

124 K. Li et al.

Table 2 Rate equations and parameters for each of the 10 reactions in the MAPK pathway diagram (Figure 1). The Michaelis-Menten constants (KI = 9, K1 = 10, K2 = 8, K3–K10 = 15) and molecular concentrations are given in nM. [Ras0] is the initial concentration of the input protein or MKKK kinase. The catalytic rate constants (k1 = k3 = k4 = k7 = k8 = 0.025) and the maximal enzyme rates (V2 = 0.25, V5 = V6 = 0:75, V9 = V10 = 0.5) are given in units of s–1 and nM.s–1, respectively

Source: Kholodenko (2000)

5.1.2 Identification of the MAPK cascade

A data set of 800 samples was generated from the simulation of the MAPK signalling cascade. The data was then normalised to within the range 0–1. Figure 2 shows a plot of MAPK-PP.

Figure 2 Plot of the MAPK-PP (800 data) (see online version for colours)

Ideally when performing this type of regression modelling, a large data set is used to make certain that the model will capture the entire range of possible dynamics of the system. However, when dealing with biological systems, the amount of data available using current experimental techniques is much smaller than this. For example, a typical differential equation model in the systems biology literature is fitted to a set of around


30–50 data points. This could be a potential stumbling block for applying the identification methods. However, provided the derived model is able to perform well when validated on previously unseen data, then the model can be said to be sufficiently accurate. To investigate the effect that data size has on performance, models were derived using two different subsets of the original samples, including 100 and 200 samples. The 100 samples set was selected from the range 400–500, and 200 samples set was selected from the range 400–600.

In each case, an LITP-TT model with RBF terms was constructed. The model input variables Ras (ut) and MAPK-PP (yt), with delays of up to 5 time steps each, were used to construct the full model set. The impact factor and the centre of the RBF terms were assigned with random values. The proposed method and the popular OLS were both used to construct the model with 100 samples and 200 samples separately, and the model is then validated on the total of 800 data samples.

When 100 data samples were used for modelling, OLS algorithm selected 20 important hidden nodes. Table 3 shows the training error and validation error on 800 samples. To investigate the proposed algorithm, a predefined 25 terms were selected in order that important terms are not missed out in the updating procedure. As it is shown in Tables 3 and 4, the terms from 17 to 25 were penalised, leading to a smaller model size and better model performance. Figure 3 shows the model output compared with the actual output.

Table 3 MAPK validation with MSE result for different methods with 100 training samples

Model size MSE-training MSE-testing

OLS 20 7.46E–4 4.30E–3 The Proposed 16 4.01E–4 3.30E–3

Table 4 Regularisation parameters and the corresponding model coefficients (100 data samples)

Selection order Hidden nodes Weights Regulariser

14 52 1.95E–01 1.44E–02 15 61 –9.18E–01 6.01E–04 16 3 4.53E–02 2.28E–01 17 4 –1.69E–02 1.01E+00 18 5 2.11E–04 9.56E+01 19 35 –8.49E–05 2.87E+02 20 57 –7.62E–10 8.34E+06 21 13 1.24E–10 1.01E+08 22 54 –1.12E–12 1.00E+10 23 49 –3.49E–12 3.21E+09 24 37 1.46E–15 1.56E+13 25 90 1.21E–16 6.61E+13

126 K. Li et al.

Figure 3 Output of LITP-TT Model train by proposed algorithm (100 samples) (see online version for colours)

The results in Tables 5 and 6 illustrate the experimental results and comparison when 200 samples were used for modelling. The proposed algorithm provided a smaller model size and better performance again. Figure 4 shows the plot of model output.

Table 5 MAPK validation with MSE result for different methods with 200 training samples

Model size MSE-training MSE-testing

OLS 26 9.98E–4 1.17E–3 The proposed 23 6.19E–4 1.13E–3

Table 6 Regularisation parameters and the corresponding model coefficients (200 data samples)

Selection order Hidden nodes Weights Regulariser

19 95 1.95E–01 1.69E–04 20 88 –9.18E–01 6.89E–05 21 134 4.53E–02 2.20E–02 22 74 –1.69E–02 7.55E–03 23 157 2.11E–04 1.25E–02 24 111 –8.49E–05 2.65E+00 25 151 –7.62E–10 1.72E+04 26 128 1.24E–10 4.74E+04 27 2 –1.12E–12 1.00E+15 28 173 –3.49E–12 1.00E+15 29 83 1.46E–15 1.00E+15 30 187 1.21E–16 1.00E+15


Figure 4 Output of LITP-TT Model train by proposed algorithm (200 samples) (see online version for colours)

To compare the performances, the results of the validation errors on 800 samples are shown in Table 5. It should be noted that due to stochastic nature in the searching, the identified models may vary from one run to another and the results shown in the table are typical cases.

5.2 The SOS DNA repair system

The proposed method was then employed to model the transcriptional network of SOS DNA repair system of E. coli (Ronen et al., 2002). Normally, a master repressor protein known as LexA binds to the promoter regions. Once part of the DNA is damaged, a sensor protein RecA binds itself to single-strand DNA, leading to the cleavage of LexA. The drop in LexA level causes the de-repression of other SOS genes (e.g., recA, lexA, umuD, polB, etc.). The repair system is then activated. Figure 5 illustrates the activation process. As the damage is repaired, LexA level increases, and represses the SOS gene again. The transcriptional activity of genes involved is monitored by means of low-copy reporter plasmid in which a promoter controls Green Fluorescent Protein (GFP) (Ronen et al., 2002). The data is available from http://www.weizmann.ac.il/ mcb/UriAlon.

Figure 5 The SOS DNA repair system in E. coli

Source: Cho et al. (2006)

128 K. Li et al.

The data consists of eight main genes involved in the repair system and four experiments were implemented under different ultraviolet (UV) levels. Each experiment collected 50 points under an interval of 6 min. The proposed algorithm was applied on the first data set with 6 genes (uvrD, lexA, umuD, recA, uvrA and polB (Cho et al., 2006)) chosen to build the regulation network. The power law function was adopted, and the power parameter of each gene was randomly generated between [–2.0, 2.0]. In the candidate pool, 1000 cross terms were randomly generated, and each term consists of two genes as shown in equation (41).

1 2 .r ri k lx xϕ = (41)

where i = 1, …, 1000 is the index of candidate term; 1 6k≤ and 1 6l≤ are the index of 6 genes uniformly generated from [1, 6] and r1 and r2 are random numbers uniformly generated from [–2.0, 2.0]. A separate model for each gene was constructed by the proposed algorithm, each model has 6–8 terms, and their prediction errors are shown in Table 7. Specifically, the gene lexA can be expressed as:

1.7816 1.2305 0.5315 1.7261 1.9317 0.82131 4 6 3 6

0.0808 0.6379 0.6632 0.9410 1.9822 0.46546 4 5

( 1) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( ) .lexAy t x t y t x t x t x t x t

x t y t x t y t x t y t

− − − −

− − −

+ = + +

+ + + (42)

where xi (i = 1, 3, 4, 5, 6) are genes uvrD, umuD, recA, uvrA and polB, respectively. The prediction performance of system dynamics is illustrated in Figure 6, which

confirms that the model predictions match well with the actual measurements.

Table 7 Comparison of modelling errors

Gene uvrD lexA umuD recA UvrA polB Results in Ronen et al. (2002) 0.2 0.1 0.21 0.12 0.14 0.31 The proposed 0.1069 0.0504 0.0922 0.0748 0.092 0.1774

Figure 6 The prediction of genes involved in DNA repair procedure in E. coli. (dotted line: the measured concentration; solid line: prediction concentration) (see online version for colours)

(a) (b)


Figure 6 The prediction of genes involved in DNA repair procedure in E. coli. (dotted line: the measured concentration; solid line: prediction concentration) (see online version for colours) (continued)

(c) (d)

(e) (f)

6 Conclusions

In this paper, an improved ELM has been investigated, which aims to obtain a more compact model while without significantly increasing the overall computational complexity. This has been achieved by associating each model term to a regularised parameter, thus insignificant ones are automatically penalised and unselected, leading to improved model sparsity. The method has then been applied to the model identification of the MAPK signal transduction pathway and the SOS DNA repair system. Experimental results have confirmed the effectiveness in producing more compact models with improved performance.

References Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K. and Walter, P. (2002) Molecular Biology

of the Cell, 4th ed., Garland Science. Bishop, C.M. (1995) Neural Networks for Pattern Recognition, Clarendon Press, Oxford.

130 K. Li et al.

Chen, S., Cowan, C.F.N. and Grant, P.M. (1991) ‘Orthogonal least squares learning algorithm for radial basis functions’, IEEE Transactions on Neural Networks, Vol. 2, pp.302–309.

Chng, E.S., Chen, S. and Mulgrew, B. (1996) ‘Gradient radial basis function networks for nonlinear and nonstationary time series prediction’, IEEE Transactions on Neural Networks, Vol. 7, No. 1, pp.190–194.

Cho, D., Cho, K. and Zhang, B. (2006) ‘Identification of biochemical networks by s-tree based genetic programming’, Bioinformatics, Vol. 22, No. 1313, p.1631.

Gormley, P., Li, K. and Irwin, G. (2007) ‘Modelling molecular interaction pathways using a two-stage identification algorithm’, Systems and Synthetic Biology, Vol. 1, No. 3, August, pp.145–160.

Hagan, M.T. and Menhaj, M.B. (1994) ‘Training feedforward networks with the Marquardt algorithm’, IEEE Transactions on Neural Networks, Vol. 5, No. 6, pp.989–993.

Handoko, S.D., Keong, K.C., Soon, O.Y., Zhang, G.L. and Brusic, V. (2006) ‘Extreme learning machine for predicting HLA-peptide binding’, Lecture Notes in Computer Science, Vol. 3973, pp.716–721.

Huang, G-B. and Chen, L. (2007) Convex Incremental Extreme Learning Machine, Neurocomputing, (in press, available online).

Huang, G.B. and Chen, L. (2008) ‘Enhanced random search based incremental extreme learning machine’, Neurocomputing, Vol. 71, pp.3460–3468.

Huang, G-B., Chen, L. and Siew, C-K. (2006a) ‘Universal approximation using incremental constructive feedforward networks with random hidden nodes’, IEEE Transactions on Neural Networks, Vol. 17, No. 4, pp.879–892.

Huang, G-B., Saratchandran, P. and Sundararajan, N. (2005) ‘A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation’, IEEE Transactions on Neural Networks, Vol. 16, No. 1, pp.57–67.

Huang, G-B., Zhu, Q-Y., Mao, K.Z., Siew, C-K., Saratchandran, P. and Sundararajan, N. (2006b) ‘Can threshold networks be trained directly?’, IEEE Transactions on Circuits and Systems II, Vol. 53, No. 3, pp.187–191.

Huang, G-B., Zhu, Q-Y. and Siew, C-K. (2006c) ‘Extreme learning machine: theory and applications’, Neurocomputing, Vol. 70, pp.489–501.

Kadirkamanathan, V. and Niranjan, M. (1993) ‘A function estimation approach to sequential learning with neural networks’, Neural Computation, Vol. 5, pp.954–975.

Kholodenko, B.N. (2000) ‘Negative feedback and ultrasensitivity can bring about oscillations in the mitogen-activated protein kinase cascades’, European Journal of Biochemistry, Vol. 267, February, pp.1583–1588.

Kitano, H. (2002) ‘Systems biology: a brief overview’, Science, Vol. 295, March, pp.1662–1664. Korenberg, M.J. (1988) ‘Identifying nonlinear difference equation and functional expansion

representations: the fast orthogonal algorithm’, Annals of Biomedical Engineering, Vol. 16, pp.123–142.

Lawson, L. and Hanson, R.J. (1974) Solving Least Squares Problem, Prentice-Hall, Englewood Cliffs, NJ.

Levchenko, A., Bruck, J. and Sternberg, P.W. (2000) ‘Scaffold proteins may biphasically affect the levels of mitogen-activated protein kinase signaling and reduce its threshold properties’, Proceedings of the National Academy of Science, Vol. 97, No. 11, May, pp.5818–5823.

Li, K., Huang, G. and Ge, S. (2011) ‘A fast construction algorithm for feedforward neural networks’, in Rozenberg, G., Back, T. and Kok, J.N. (Eds.): Handbook for Natural Computing, Springer-Verlag, Chapter 16.

Li, K., Peng, J. and Bai, E. (2009) ‘Two-stage mixed discrete-continuous identification of radial basis function (RBF) neural models for nonlinear systems’, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol.56, No. 3, March, pp.630–643.


Li, K., Peng, J. and Irwin, G.W. (2005a) ‘A fast nonlinear model identification method’, IEEE Transactions on Automatic Control, Vol. 50, No. 8, pp.1211–1216.

Li, M-B., Huang, G-B., Saratchandran, P. and Sundararajan, N. (2005b) ‘Fully complex extreme learning machine’, Neurocomputing, Vol. 68, pp.306–314.

Liang, N-Y., Huang, G-B., Saratchandran, P. and Sundararajan, N. (2006a) ‘A fast and accurate online sequential learning algorithm for feedforward networks’, IEEE Transactions on Neural Networks, Vol. 17, No. 6, pp.1411–1423.

Liang, N-Y., Saratchandran, P., Huang, G-B. and Sundararajan, N. (2006b) ‘Classification of mental tasks from EEG signals using extreme learning machine’, International Journal of Neural Systems, Vol. 16, No. 1, pp.29–38.

Liu, Y., Loh, H.T. and Tor, S.B. (2005) ‘Comparison of extreme learning machine with support vector machine for text classification’, Lecture Notes in Computer Science, Vol. 3533, pp.390–399.

Mao, K.Z. and Huang, G-B. (2005) ‘Neuron selection for RBF neural network classifier based on data structure preserving criterion’, IEEE Transactions on Neural Networks, Vol. 16, No. 6, pp.1531–1540.

Marquardt, D. (1963) ‘An algorithm for least-squares estimation of nonlinear parameters’, SIAM J. Appl. Math., Vol. 11, pp.431–441.

Mendes, E.M.A.M. and Billings, S.A. (2001) ‘An alternative solution to the model structure selection problem’, IEEE Trans. Syst. Man Cybern-Part A., Vol. 6, No. 31, November, pp.597–608.

Miller, A.J. (1990) Subset Selection in Regression, Chapman & Hall, Victoria, Australia. Musavi, M., Ahmed, W., Chan, K., Faris, K. and Hummels, D. (1992) ‘On training of radial basis

function classifiers’, Neural Networks, Vol. 5, pp.595–603. Peng, J., Li, K. and Huang, D.S. 11 (2006) ‘A hybrid forward algorithm for RBF neural network

construction’, IEEE Transactions on Neural Networks, Vol. 17, No. 6, pp.1439–1451. Peng, J., Li, K. and Irwin, G.W. (2007) ‘A novel continuous forward algorithm for rbf neural

modelling’, IEEE Transactions on Automatic Control, Vol. 52, No. 1, pp.117–122. Peng, J., Li, K. and Irwin, G.W. (2008) ‘A new Jacobian matrix for optimal learning of

single-layer neural nets’, IEEE Transactions on Neural Networks, Vol. 19, No. 1, pp.119–129. Platt, J. (1991) ‘A resource-allocating network for function interpolation’, Neural Computation,

Vol. 3, pp.213–225. Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P. (1992) Numerical Recipes in C:

The Art of Scientific Computing, Cambridge University Press, Cambridge. Rao, C.R. and Mitra, S.K. (1971) Generalized Inverse of Matrices and its Applications, John Wiley

& Sons, Inc., New York. Ronen, M., Rosenberg, R., Shraiman, B. and Alon, U. (2002) ‘Assigning numbers to the arrows:

parameterizing a gene regulation network by using accurate expression kinetics’, Proceedings of the National Academy of Sciences of the United States of America, Vol. 99, No. 1616, pp.10555–10560.

Serre, D. (2002) Matrices: Theory and Applications, Springer-Verlag Inc., New York. Sutanto, E.L., Mason, J.D. and Warwick, K. (1997) ‘Mean-tracking clustering algorithm for radial

basis function centre selection’, International Journal of Control, Vol. 67, pp.961–977. Wang, D. and Huang, G-B. (2005) ‘Protein sequence classification using extreme learning

machine’, Proceedings of International Joint Conference on Neural Networks (IJCNN2005), Montreal, Canada.

Wang, Z., Liu, X., Liu, Y., Liang, J. and Vinciotti, V. (2009) ‘An extended Kalman filtering approach to modeling nonlinear dynamic gene regulatory networks via short gene expression time series’, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Vol. 6, No. 3, pp.410–419.

132 K. Li et al.

Wang, Z., Yang, F., Ho, D., Swift, S., Tucker, A. and Liu, X. (2008) ‘Stochastic dynamic modeling of short gene expression time-series data’, IEEE Transactions on NanoBioscience, Vol. 7, No. 1, pp.44–55.

Wolkenhauer, O., Kitano, H. and Cho, K. (2003) ‘Systems biology – looking at opportunities and challenges in applying systems theory to molecular and cell biology’, IEEE Control Systems Magazine, August, pp.38–48.

Xu, J-X., Wang, W., Goh, J.C.H. and Lee, G. (2005) ‘Internal model approach for gait modeling and classification’, 27th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS), Shanghai, China.

Yeu, C-W.T., Lim, M-H., Huang, G-B., Agarwal, A. and Ong, Y-S. (2006) ‘A new machine learning paradigm for terrain reconstruction’, IEEE Geoscience and Remote Sensing Letters, Vol. 3, No. 3, pp.382–386.

Yingwei, L., Sundararajan, N. and Saratchandran, P. (1997) ‘A sequential learning scheme for function approximation using minimal radial basis function (RBF) neural networks’, Neural Computation, Vol. 9, pp.461–478.

Zhang, G.L. and Billings, S.A. (1996) ‘Radial basis function network configuration using mutual information and the orthogonal least squares algorithm’, Neural Networks, Vol. 9, pp.1619–1637.

Zhang, R., Huang, G-B., Sundararajan, N. and Saratchandran, P. (2006) ‘Multi-category classification using an extreme learning machine for microarray gene expression cancer diagnosis’, IEEE/ACM Transactions on Computational Biology and Bioinformatics, IEEE Computer Society Press, Los Alamitos, CA, USA, Vol. 4, No. 3, pp.485–495.

Date post:	08-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Compact Extreme Learning Machines for biological systems · Compact Extreme Learning Machines for...

Documents