+ All Categories
Home > Documents > The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic...

The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic...

Date post: 23-Nov-2016
Category:
Upload: zhijian-huang
View: 218 times
Download: 0 times
Share this document with a friend
8
The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach Zhijian Huang a,b,, Jie Ma a , He Huang a a Ocean Engineering State Key Laboratory, Shanghai Jiao Tong University, Shanghai 200030, PR China b Merchant Marine College, Shanghai Maritime University, Shanghai 200135, PR China article info Keywords: Adaptive critic designs Approximate dynamic programming Iterative convergence calculation Neural network abstract The standard approximate dynamic programming has only one action output. It’s applied to single control variable system, such as inverted pendulum. For multi-input multi-output system, approximate dynamic programming needs a complex scheme. Few papers have derived its iterative convergence calculation, or the presented algorithm lacks rigorous mathematical basis. This paper fist researches matrix analysis foundation for the derivation of multi-input multi-output approximate dynamic programming. The research finds flaws in mathematics of a typical algorithm of its derivation. Hence, we promote approximate dynamic programming to multi-input multi-output form. The detailed iterative conver- gence calculation of it is derived. An experiment shows its effect. This algorithm is proved to be rigorous in mathematics and not complicated. It is effective for the iterative conver- gence calculation of multi-input multi-output approximate dynamic programming. Ó 2012 Elsevier Inc. All rights reserved. 1. Introduction Dynamic programming was first proposed by Bellman [1]. In order to solve its curse of dimensionality, in 1977, Werbos [2] proposed an approach for approximate dynamic programming (ADP) that is later called adaptive critic designs (ACDs). This approach uses artificial neural network as function approximation of cost-to-go in dynamic programming. To imple- ment the ACDs algorithm, Werbos [3] later introduced a means to get around this numerical complexity by using ACDs for- mulas. A particularly impressive success that greatly motivated subsequent research is the development of a backgammon playing program by Tesauro [4]. He clearly presented the concept of critic neural network to approximate the optimal cost function in control problem. Bertsekas and Tsitsiklis [5] gave an overview of neural dynamic programming in their book. In 1997, Prokhorov and Wunsch [6] developed more algorithms according to ACDs. In recent years, ADP has gained much atten- tion from researchers. Si et al. [7] discussed their relations to artificial intelligence, approximation theory, control theory, operations research and statistics etc. Powell [8] showed how ADP can solve the curse of dimensionality for complex deter- ministic or stochastic optimization problems and pointed out future directions of ADP [9]. Conventional ADP contains three basic modules: critic, model and action. By combining the critic network and the model network to form a new critic network [10], we get a form of action-dependent heuristic dynamic programming (ADHDP) where the critic network implicitly includes a model network. However, Si’s standard ADHDP [11] is for single control var- iable system, such as inverted pendulum. Its iterative convergence algorithm is also for single output system. This algorithm is rigorous in mathematics [12]. But Si didn’t present formulas for nonlinear multi-input multi-output (MIMO) ADP in this paper [11]. 0096-3003/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.amc.2012.10.054 Corresponding author. E-mail address: [email protected] (Z. Huang). Applied Mathematics and Computation 219 (2013) 4495–4502 Contents lists available at SciVerse ScienceDirect Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc
Transcript
Page 1: The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach

Applied Mathematics and Computation 219 (2013) 4495–4502

Contents lists available at SciVerse ScienceDirect

Applied Mathematics and Computation

journal homepage: www.elsevier .com/ locate/amc

The derivation of iterative convergence calculation for a nonlinearMIMO approximate dynamic programming approach

Zhijian Huang a,b,⇑, Jie Ma a, He Huang a

a Ocean Engineering State Key Laboratory, Shanghai Jiao Tong University, Shanghai 200030, PR Chinab Merchant Marine College, Shanghai Maritime University, Shanghai 200135, PR China

a r t i c l e i n f o a b s t r a c t

Keywords:Adaptive critic designsApproximate dynamic programmingIterative convergence calculationNeural network

0096-3003/$ - see front matter � 2012 Elsevier Inchttp://dx.doi.org/10.1016/j.amc.2012.10.054

⇑ Corresponding author.E-mail address: [email protected] (Z. Huang)

The standard approximate dynamic programming has only one action output. It’s appliedto single control variable system, such as inverted pendulum. For multi-input multi-outputsystem, approximate dynamic programming needs a complex scheme. Few papers havederived its iterative convergence calculation, or the presented algorithm lacks rigorousmathematical basis. This paper fist researches matrix analysis foundation for the derivationof multi-input multi-output approximate dynamic programming. The research finds flawsin mathematics of a typical algorithm of its derivation. Hence, we promote approximatedynamic programming to multi-input multi-output form. The detailed iterative conver-gence calculation of it is derived. An experiment shows its effect. This algorithm is provedto be rigorous in mathematics and not complicated. It is effective for the iterative conver-gence calculation of multi-input multi-output approximate dynamic programming.

� 2012 Elsevier Inc. All rights reserved.

1. Introduction

Dynamic programming was first proposed by Bellman [1]. In order to solve its curse of dimensionality, in 1977, Werbos[2] proposed an approach for approximate dynamic programming (ADP) that is later called adaptive critic designs (ACDs).This approach uses artificial neural network as function approximation of cost-to-go in dynamic programming. To imple-ment the ACDs algorithm, Werbos [3] later introduced a means to get around this numerical complexity by using ACDs for-mulas. A particularly impressive success that greatly motivated subsequent research is the development of a backgammonplaying program by Tesauro [4]. He clearly presented the concept of critic neural network to approximate the optimal costfunction in control problem. Bertsekas and Tsitsiklis [5] gave an overview of neural dynamic programming in their book. In1997, Prokhorov and Wunsch [6] developed more algorithms according to ACDs. In recent years, ADP has gained much atten-tion from researchers. Si et al. [7] discussed their relations to artificial intelligence, approximation theory, control theory,operations research and statistics etc. Powell [8] showed how ADP can solve the curse of dimensionality for complex deter-ministic or stochastic optimization problems and pointed out future directions of ADP [9].

Conventional ADP contains three basic modules: critic, model and action. By combining the critic network and the modelnetwork to form a new critic network [10], we get a form of action-dependent heuristic dynamic programming (ADHDP)where the critic network implicitly includes a model network. However, Si’s standard ADHDP [11] is for single control var-iable system, such as inverted pendulum. Its iterative convergence algorithm is also for single output system. This algorithmis rigorous in mathematics [12]. But Si didn’t present formulas for nonlinear multi-input multi-output (MIMO) ADP in thispaper [11].

. All rights reserved.

.

Page 2: The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach

4496 Z. Huang et al. / Applied Mathematics and Computation 219 (2013) 4495–4502

In fact, only a handful of researches have been done in the above nonlinear MIMO ADP area For example, Enns and Si [13]demonstrated a model-free helicopter flight controller. This is the first time that ADP has been systematically applied to acomplex, continuous state, nonlinear MIMO system with uncertainty. They adopted cascaded neural network scheme. J.A.Mulder [14] applied ADHDP to continuous adaptive critic flight control by referring to Enns and Si. He called it approximatedplant dynamics approach. This approach adopted two action networks, and each still has only one output. Lee and coworkers[15] controlled an MIMO methyl methacrylate polymerization reactor with ADP. They used k-nearest neighbor averageralgorithm to improve the control performance. Padhi and coworkers [16] solved a multi-critic-output control problem, suchas a real-life micro-electro mechanical system. Padhi got around numerical problems in training with sub critic-networksstructure. Some other scholars also adopted ADP in nonlinear MIMO systems. For example, see the work of Liu [17], Murray[18], Kulkarni [19] and Lin et al. [20]. Nevertheless, they focused on controlled objects. The detailed algorithm is not furtherstudied. Most of them gave control effect directly without derivation procedure and calculation formulas. To apply ADP tononlinear MIMO system, an extension scheme is needed. So, nonlinear MIMO ADP brings the problem how to carry out itsiterative convergence calculation. Lin and his students [20–21] adopted matrix derivative algorithm. This is a typical algo-rithm for many scholars, especially in China. Lin et al. claimed to achieve a good effect [20], whereas, this algorithm lacks arigorous mathematical basis.

Therefore, this paper researches matrix analysis foundation for the derivation of nonlinear MIMO ADP. The research findsflaws in the mathematical derivation of a typical algorithm. So we promote ADP to MIMO system by extending its actionmodule to multi-output. The detailed iterative convergence algorithm is derived. This algorithm adopts the derivative of sca-lar with respect to scalar for a rigorous mathematical basis. An engine idling experiment of cylinder balance control is usedto verify its feasibility. This algorithm is effective and available.

2. Matrix analysis foundation for derivation

Provided that matrix A = [aij]m�n, vector x = [x1,x2, . . . ,xm], vector y = [y1,y2, . . .,yn]T.

Then@xT

@x¼ @x

xT¼ I;hereIis an unit matrix: ð1Þ

Set q ¼ x � y ¼ ½x1; x2; . . . ; xm�½y1; y2; . . . ; yn�T ¼

Pni¼1xiyi, here m = n.

Then

@q@x¼ @ x � yð Þ

x¼@Pn

i¼1xiyi

� �@x

¼ yT ;@q@y¼ @ðx � yÞ

y¼@Pn

i¼1xiyi

� �@y

¼ xT : ð2Þ

Set

q ¼ Ay ¼Xn

j¼1

a1jyj;Xn

j¼1

a2jyj; . . . ;Xn

j¼1

amjyj

" #T

:

Then

@q@A¼ @Ay

@A¼

@Xn

j¼1

a1jyj;Xn

j¼1

a2jyj; . . . ;Xn

j¼1

amjyj

" #T0@

1A

@A¼D

@Pn

j¼1a1jyj

� �@aij

;@Pn

j¼1a2jyj

� �@aij

; . . . ;@Pn

j¼1amjyj

� �@aij

24

35

T

mm�n

ð3Þ

According to the Chapter 5 of reference [12], the derivative of compound vector or matrix function satisfies chain ruleunder two conditions besides compound scalar function.

(1) Define f(x) as real-valued function, y(x) as valued-function of vector x, x as column vector, then:

@f ðyðxÞÞ@x

¼ @yTðxÞ@x

@f ðyÞ@y

: ð4Þ

(2) Define A as m � n matrix, y = f(A) and g(y) are real-valued function of matrix A and scalar y respectively, then:

@gðf ðAÞÞ@A

¼ dgðyÞdy

@f ðAÞ@A

: ð5Þ

Page 3: The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach

Z. Huang et al. / Applied Mathematics and Computation 219 (2013) 4495–4502 4497

As matrix multiplication does not usually meet commutative law, the derivative of matrix or vector cannot usually satisfy

the chain rule except for the above two cases. That is, the equation ddt ðAðtÞÞ

m ¼ m � ðAðtÞÞm�1 � ddt AðtÞ cannot hold. Here

A(t) = (aij(t))n�n, its element aij(t) is a differentiable function of variable t. For example, take m = 2, AðtÞ ¼ t tt þ 1 t � 1

� �. Then

ðAðtÞÞ2 ¼ t tt þ 1 t � 1

� �2

¼ 2t2 þ t 2t2 � t2t2 þ t � 1 2t2 � t þ 1

� �;

Fig. 1.work. T

ddtðAðtÞÞm ¼ d

dtðAðtÞÞ2 ¼ d

dt2t2þ 2t2 � t

2t2 þ t � 1 2t2 � t þ 1

" #¼

4t þ 1 4t � 14t þ 1 4t � 1

� �;

whereas, m � ðAðtÞÞm�1 � ddt AðtÞ ¼ 2 � t t

t þ 1 t � 1

� �� d

dtt t

t þ 1 t � 1

� �¼ 4t 4t

4t 4t

� �, so, d

dt ðAðtÞÞm–m � ðAðtÞÞm�1 � d

dt AðtÞ.

This example proves that the derivative of matrix with respect to scalar does not necessarily meet the chain rule. As thederivative of matrix with respect to vector includes the derivative of matrix with respect to each element in the vector, itsderivative even does not necessarily meet the chain rule. So is the derivative of matrix with respect to matrix. Therefore, weconclude that only Eqs. (4) and (5), as well as the derivative of scalar with respect to scalar, can surely satisfy the chain rule.

3. The nonlinear MIMO ADHDP approach

In ADP framework, an approximate solution is provided to solve the following equation [15]:

J½xðtÞ; t� ¼ minuðtÞ;uðtþ1Þ;...

X1i¼t

ai�tr½xðiÞ;uðiÞ; i� !

¼ minuðtÞðr½xðtÞ;uðtÞ; t� þ aJ½xðt þ 1Þ; t þ 1�Þ; ð6Þ

where J[x(t), t] is the minimum cost from time t on, r is the utility function, a is the discount factor with 0 < a 6 1. The objec-tive is to choose control sequence u(i), i = t, t + 1, ... so that the J function is minimized.

Define the future accumulated cost at time t as [11]

QðtÞ ¼ rðt þ 1Þ þ a � rðt þ 2Þ þ . . . ¼ J½xðt þ 1Þ; t þ 1�: ð7Þ

Fig. 1 is the principle of our nonlinear MIMO ADHDP approach. The critic module is used to approximate the cost functionQ(t). The multi-output action module is used to output the optimal control vector u(t) = [u1(t),u2(t), ... ,um(t)]. These two mod-ules can be regarded as an intelligent agent. This intelligent agent interfaces with controlled object under reinforcementlearning.

Symbols are seen in Fig. 1. Define the prediction error for the critic network as follows:

ecðtÞ ¼ aQðtÞ � ½Qðt � 1Þ � rðtÞ�; ð8Þ

EcðtÞ ¼12

e2c ðtÞ: ð9Þ

The weights update method of the critic network is the gradient descent algorithm [11] to minimize Ec(t) given by

wcðt þ 1Þ ¼ wcðtÞ þ lcðtÞ � DwcðtÞ; ð10Þ

DwcðtÞ ¼ �@EcðtÞ@wcðtÞ

� �¼ � @EcðtÞ

@ecðtÞ� @ecðtÞ@QðtÞ �

@QðtÞ@wcðtÞ

� �: ð11Þ

Schematic diagram demonstrats the principle of our nonlinear multi-input multi-output action-dependent heuristic dynamic programming in thishe solid lines represent signal flow, while the dashed lines are the paths for weights turning.

Page 4: The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach

4498 Z. Huang et al. / Applied Mathematics and Computation 219 (2013) 4495–4502

Define the prediction error for the action network as follows:

eaðtÞ ¼ QðtÞ; ð12Þ

EaðtÞ ¼12

e2aðtÞ: ð13Þ

The weights update method of the action network is also the gradient descent algorithm [11] to minimize Ea(t) given by

waðt þ 1Þ ¼ waðtÞ þ laðtÞ � DwaðtÞ; ð14Þ

DwaðtÞ ¼ �@EaðtÞ@waðtÞ

� �¼ � @EaðtÞ

@eaðtÞ� @eaðtÞ@QðtÞ �

@QðtÞ@waðtÞ

� �: ð15Þ

where lc(t) > 0 and la(t) > 0 are learning rates of the critic and action networks at time t respectively.Let s be the input vector (16) to the critic network. For the critic network, from input to output, there should be

s ¼ ½x1ðtÞ; x2ðtÞ; . . . ; xnðtÞ; u1ðtÞ; u2ðtÞ; . . . ;umðtÞ�; ð16Þ

qiðtÞ ¼Xnþm

j¼1

wð1ÞcijðtÞsjðtÞ; ð17Þ

piðtÞ ¼1� expð�qiðtÞÞ

1þ expð�qiðtÞÞ; ð18Þ

QðtÞ ¼XNh1

i¼1

wð2ÞciðtÞpiðtÞ: ð19Þ

For the action network, from input to output, there should be

hiðtÞ ¼Xn

j¼1

wð1ÞaijðtÞxjðtÞ; ð20Þ

giðtÞ ¼1� expð�hiðtÞÞ

1þ expð�hiðtÞÞ; ð21Þ

vkðtÞ ¼XNh2

i¼1

wð2ÞakiðtÞgiðtÞ; ð22Þ

ukðtÞ ¼1� expð�vkðtÞÞ

1þ expð�vkðtÞÞ: ð23Þ

4. The iterative convergence calculation for nonlinear MIMO ADHDP

4.1. The matrix derivative algorithm of iterative convergence

Lin and his students’ algorithm [20–21] resolve derivative convergence calculation with assumed chain rule. It calculateseach item individually. These items are then combined with matrix product and Matlab dot product. For the weights updateof the critic network, Lin et al. got the follows [21]:

Dwð2Þc ðtÞ ¼ �@EcðtÞ@wð2Þc ðtÞ

¼ � @EcðtÞ@ecðtÞ

� @ecðtÞ@QðtÞ �

@QðtÞ@wð2Þc ðtÞ

¼ �a � ecðtÞ � pðtÞT ; ð24Þ

Dwð1Þc ðtÞ ¼ �@EcðtÞ@wð1Þc ðtÞ

¼ � @EcðtÞ@ecðtÞ

� @ecðtÞ@QðtÞ �

@QðtÞ@wð1Þc ðtÞ

¼ � @EcðtÞ@ecðtÞ

� @ecðtÞ@QðtÞ �

@QðtÞ@pðtÞ �

@pðtÞ@qðtÞ �

@qðtÞ@wð1Þc ðtÞ

¼ �a � ecðtÞ �wð2Þc ðtÞT � ð1� pðtÞ � pðtÞÞ � sðtÞT : ð25Þ

Where the � symbol means Matlab dot product. According to Eqs. (4) and (2), the Eq. (24) can meet the chain rule, and itsderivative calculation is correct. However, for Eq. (25), we have proved in Section 2 that it does not necessarily meet thechain rule because this formula includes the derivative of vector with respect to vector. In addition, the derivative of the last

Page 5: The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach

Z. Huang et al. / Applied Mathematics and Computation 219 (2013) 4495–4502 4499

item in Eq. (25) should expand its dimensions according to Eq. (3), but the derivative result does not. Therefore, this algo-rithm obviously lacks rigorous mathematical basis in several aspects.

Similarly, Lin et al. updates the action network weights [21] as follows. These formulas have the same flaws except thatthe calculation is more complicated, which will not be repeated here.

Dwð2Þa ðtÞ ¼ � @EaðtÞ@wð2Þa ðtÞ

¼ � @EaðtÞ@QðtÞ �

@QðtÞ@wð2Þa ðtÞ

¼ � @EaðtÞ@QðtÞ �

@QðtÞ@pðtÞ �

@pðtÞ@qðtÞ �

@qðtÞ@uðtÞ �

@uðtÞ@vðtÞ �

@vðtÞ@wð2Þa ðtÞ

¼ �QðtÞ � ðwð1Þcð:;nþ1:nþmÞ ðtÞÞT � ðwð2Þc ðtÞT � ð1� pðtÞ � pðtÞÞÞ � ð1� uðtÞ � uðtÞÞ � gðtÞT

Dwð1Þa ðtÞ ¼ � @EaðtÞ@wð1Þa ðtÞ

¼ � @EaðtÞ@QðtÞ �

@QðtÞ@wð1Þa ðtÞ

¼ � @EaðtÞ@QðtÞ �

@QðtÞ@pðtÞ �

@pðtÞ@qðtÞ �

@qðtÞ@uðtÞ �

@uðtÞ@vðtÞ �

@vðtÞ@gðtÞ �

@gðtÞ@hðtÞ �

@hðtÞ@wð2Þa ðtÞ

¼ �QðtÞ � ðwð2Þa ðtÞT � ððwð1Þcð:;nþ1:nþmÞ ðtÞÞT � ðwð2Þc ðtÞT � ð1� pðtÞ � pðtÞÞÞ � ð1� uðtÞ � uðtÞÞÞ � ð1� gðtÞ � gðtÞÞÞ � xðtÞT

ð27Þ

4.2. The presented algorithm of iterative convergence

In order to meet the chain rule and have a rigorous mathematical basis, we only adopt the derivative of scalar with re-spect to scalar. That is, every calculating formula can update only one connecting weight each time. But all the networkweights should be updated at the same time step to prevent confusion.

Therefore, for the critic network, according to (11):

Dwð2ÞciðtÞ ¼ � @EcðtÞ

@wð2ÞciðtÞ

" #¼ � @EcðtÞ

@QðtÞ �@QðtÞ@wð2Þci

ðtÞ

" #¼ �a � ecðtÞpiðtÞ; ð28Þ

Dwð1ÞcijðtÞ ¼ � @EcðtÞ

@wð1ÞcijðtÞ

" #¼ � @EcðtÞ

@QðtÞ �@QðtÞ@piðtÞ

� @piðtÞ@qiðtÞ

� @qiðtÞwð1ÞcijðtÞ

" #¼�a � ecðtÞwð2Þci

ðtÞ 12 ð1� p2

i ðtÞÞ� �

xjðtÞ 1 6 j 6 n

�a � ecðtÞwð2ÞciðtÞ 1

2 ð1� p2i ðtÞÞ

� �uj�nðtÞ nþ 1 6 j 6 nþm

(

ð29Þ

For the action network, each weight wð2ÞakiðtÞ is connected to only one action output. The unconnected u(t) has no influence

on this weight. Thus, only the connected u(t) item should be considered for this weight’s update. According to (15):

Dwð2ÞakiðtÞ ¼ � @EaðtÞ

@wð2ÞakiðtÞ

� �¼ � @EaðtÞ

@QðtÞ �@QðtÞ@ukðtÞ

� @ukðtÞ@vkðtÞ

� @vkðtÞ@wð2Þaki

ðtÞ

� �¼ �eaðtÞ �

XNh1

i¼1

ðwð2ÞciðtÞ � 1

2 ð1� p2i ðtÞÞ �w

ð1Þci;nþkðtÞÞ � 1

2 ð1� u2kðtÞÞ

� �� giðtÞ

ð30Þ

where

@QðtÞ@ukðtÞ

¼XNh1

i¼1

@ wð2ÞciðtÞ 1�expð�qiðtÞÞ

1þexpð�qiðtÞÞ

� �qiðtÞ

� @qiðtÞ@ukðtÞ

¼XNh1

i¼1

wð2ÞciðtÞ � 1

2ð1� p2

i ðtÞÞ �wð1Þci;nþkðtÞ

� : ð31Þ

However, each weight wð1ÞaijðtÞ is connected to all the output nodes through one hidden node in multi-output action net-

work. Thus, this multi-output action network can be seen as a linear combination of multiple single-output action networksfor the weight update of wð1Þaij

ðtÞ. According to Eqs. (15) and (31):

Dwð1ÞaijðtÞ ¼ �

Xm

k¼1

@EaðtÞ@wð1Þaij

ðtÞ

" #¼ �

Xm

k¼1

@EaðtÞ@QðtÞ �

@QðtÞ@ukðtÞ

� @ukðtÞ@vkðtÞ

� @vkðtÞ@giðtÞ� @giðtÞ@hiðtÞ� @hiðtÞ@wð1Þaij

ðtÞ

" #

¼ �Xm

k¼1

eaðtÞ �XNh1

i¼1

ðwð2ÞciðtÞ � 1

2 ð1� p2i ðtÞÞ �w

ð1Þci;nþkðtÞÞ � 1

2 ð1� u2kðtÞÞ

� ��wð2Þaki

ðtÞ � 12 ð1� g2

i ðtÞÞ � xjðtÞ" #

:

ð32Þ

5. Experiment and results

We applied our nonlinear MIMO ADHDP algorithm to an engine idling for cylinder balance control. The engine model fol-lows the model by Kim and Park [22] and Shim et al. [23] and our simulation results [24]. The controller is a two-input four-output form. It intelligently adjusts the ignition timings of four cylinders to achieve control objective. The design parametersof this nonlinear MIMO ADHDP controller are seen in Table 1.

The action network is chosen as a 2–8–4 structure with two input neurons, eight hidden layer neurons and four outputneurons. The two inputs are the magnitudes of the fundamental frequency in the Discrete Fourier Transform (DFT) corre-sponding to one full period of engine cycle, and the magnitude of the second-harmonic frequency corresponding to half

Page 6: The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach

Fig. 2. Cylinder balance control procedure of engine speed at idle.

Fig. 3. Error: Ec(t) and Ea(t) convergence process of the critic and action networks.

Table 1Design parameters of our nonlinear MIMO ADHDP controller.

The critic network The action network

Learning rate: lc(t) = 0.3, decreases 0.05 per iteration until 0.01 la(t) = 0.3, decreases 0.05 per iteration until 0.01Desired training error: Tc = 0.05 Ta = 0.005Maximum training cycles: 50 times 500 timesDiscount factor: a = 0.95Network weights initialization: (�1,+1), random

4500 Z. Huang et al. / Applied Mathematics and Computation 219 (2013) 4495–4502

Page 7: The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach

Z. Huang et al. / Applied Mathematics and Computation 219 (2013) 4495–4502 4501

of the period. The four outputs are the four ignition timings of cylinders 1–4. Both the hidden layer and the output layer usethe sigmoidal function: y = (1�e�x)/(1 + e�x), which is proper for approximation of nonlinear function.

The critic network is chosen as a 6–12–1 structure with six input neurons, twelve hidden layer neurons and one outputneuron. The six inputs are the magnitude of the fundamental frequency in the DFT corresponding to one full period of enginecycle, the magnitude of the second harmonic frequency corresponding to half of the period, and the four ignition timings ofcylinders 1–4. The hidden layer of the critic network also adopts the above sigmoidal function.

The simulation shows that our ADHDP controller intelligently manipulates spark ignition timings of cylinders 1–4 to sup-press unbalanced combustions. It takes eight iterations to suppress non-uniformity revolving among cylinders (Fig. 2). Thecontrol process can converge uniformly. There is also no overshoot. The critic and action network errors converge to an ex-tremely small value quickly. If the errors of the critic and action networks are given an initial value of 0.2 in iteration 0 orbefore training, the process of convergence to a minimum can be observed more clearly (Fig. 3). In addition, this algorithmhas much faster iteration convergence speed. It’s probably at least two times faster than any published results, such as: PIcontrol [23], neural network control [25], genetic algorithm [22], and Alopex algorithm [22] etc.

6. Conclusions

This paper begins from considering the current conditions of ADP approach for nonlinear MIMO system. Some mathemat-ical bases, such as the derivative of compound vector/matrix function, are researched for the iterative convergence calcula-tion of nonlinear MIMO ADP approach. We also prove that Eqs. (4) and (5), as well as the derivative of scalar with respect toscalar, can surely satisfy the chain rule, whereas, other cases cannot necessarily hold.

The mathematical flaws of a typical iterative convergence algorithm for nonlinear MIMO ADP are analyzed. So we proposeour iterative convergence algorithm by extending action network to multi-output form. This algorithm only adopts thederivative of scalar with respect to scalar, and it can completely satisfy the chain rule. Hence, the derivation is rigorous inmathematics. This algorithm has been successfully applied to nonlinear MIMO system, an engine idling for cylinder balancecontrol. The experiment results show that the derivation is correct and available.

Therefore, we have first solved the iterative convergence calculation for nonlinear MIMO ADP in theory. This algorithm isalso not complicated. It’s an effective approach to solve the iterative convergence calculation of nonlinear MIMO ADHDP.

Acknowledgement

The authors would like to thank the anonymous reviewers for their helpful comments and high quality suggestions. Thiswork was supported by the NSFC Projects under Grant Nos. 50979058, 31170952, by the Innovation Program of ShanghaiMunicipal Education Commission under Grant No. 11ZZ143.

References

[1] R. Bellman, Dynamic Programming, Dover Publications Inc., Mineola, NY, 2003.[2] P.J. Werbos, Advanced forecasting methods for global crisis warning and models of intelligence, General System Yearbook 22 (1977) 25–38.[3] P.J. Werbos, Approximate dynamic programming for real-time control and neural modeling, in: D.A. White, D.A. Sofge, N.Y. Van Nostrand Reinhold,

(Eds.), Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, 1992, 493–525.[4] G. Tesauro, Practical issues in temporal difference learning, Machine Learning 8 (1992) 257–277.[5] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific Belmont, MA, 1996.[6] D.V. Prokhorov, D.C. Wunsch, Adaptive critic designs, IEEE Transactions on Neural Networks 8 (5) (1997) 997–1007.[7] J. Si, A.G. Barto, W.B. Powell, D. Wunsch, Handbook of Learning and Approximate Dynamic Programming, John Wiley & Sons Inc, Hoboken, NJ, 2004.[8] W.B. Powell, Approximate Dynamic Programming Solving the Curses of Dimensionality, John Wiley & Sons Inc, Hoboken, NJ, 2007.[9] F.Y. Wang, H. Zhang, D. Liu, Adaptive dynamic programming: an introduction, IEEE Computational Intelligence Magazine (2009) 39–47.

[10] D. Liu, X. Xiong, Y. Zhang, Action-dependent adaptive critic designs, in: Proceedings of the IEEE-INNS International Joint Conference on NeuralNetworks, Washington, DC, 2001, 990–995.

[11] J. Si, Y.T. Wang, On-line learning control by association and reinforcement, IEEE Transactions on Neural Networks 12 (2) (2001) 264–276.[12] X.D. Zhang, Matrix Analysis and Applications, TsingHua University Press, Beijing, 2004.[13] R. Enns, J. Si, Helicopter trimming and tracking control using direct neural dynamic programming, IEEE Transactions on Neural Networks 14 (4) (2003)

929–939.[14] E.J. van Kampen, J.A. Mulder, Continuous Adaptive Critic Flight Control using Approximated Plant Dynamics, Master Thesis, Delft University of

Technology, Delft, 2006.[15] J.H. Lee, J.M. Lee, Approximate dynamic programming based approach to process control and scheduling, Computer and Chemical Engineering 30

(2006) 1603–1618.[16] R. Padhi, N. Unnikrishnan, X. Wang, S.N. Balakrishnan, A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of

nonlinear systems, Neural Networks 19 (10) (2006) 1648–1660.[17] D. Liu, H. Javaherian, O. Kovalenko, T. Huang, Adaptive critic learning techniques for engine torque and air-fuel ratio control, IEEE Transactions on

Systems, Man, and Cybernetics-Part B: Cybernetics 38 (4) (2008) 988–993.[18] J.J. Murray, C.J. Cox, G.G. Lendaris, R. Saeks, Adaptive dynamic programming, IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications

and Reviews 32 (2) (2002) 140–153.[19] N.V. Kulkarni, K. Krishnakumar, Intelligent engine control using an adaptive critic, IEEE Transactions on Control Systems Technology 11 (2) (2003)

164–173.[20] X. Lin, S. Lei, C. Song, S. Song, D. Liu, ADHDP for the pH value control in the clarifying process of sugar cane juice, Lecture Notes in Computer Science

5263 (2008) (2008) 796–805.[21] S. Lei, X. Lin, Research on the Applications of Adaptive Dynamic Programming in pH Value Control of the Clarifying Progress in Sugar Refineries, Master

Thesis, GuangXi University, Nanning, 2008.

Page 8: The derivation of iterative convergence calculation for a nonlinear MIMO approximate dynamic programming approach

4502 Z. Huang et al. / Applied Mathematics and Computation 219 (2013) 4495–4502

[22] D.E. Kim, J. Park, Application of adaptive control to the fluctuation of engine speed at idle, Information Sciences 177 (16) (2007) 3341–3355.[23] D. Shim, J. Park, P.P. Khargonekar, W.B. Ribbens, Reducing automotive engine speed fluctuation at idle, IEEE Transactions on Control Systems

Technology 4 (4) (1996) 404–410.[24] Z. Huang, J. Ma, H. Huang, An approximate dynamic programming method for multi-input multi-output nonlinear system, Optimal Control

Applications and Methods (2011), http://dx.doi.org/10.1002/oca.1031.[25] D.E. Kim, J. Park, Neural network control for reducing engine speed fluctuation at idle, IEEE Transactions on Control Systems Technology 4 (1999) 629–

634.


Recommended