+ All Categories
Home > Documents > Kernel Temporal Differences for Neural Decoding

Kernel Temporal Differences for Neural Decoding

Date post: 29-Apr-2023
Category:
Upload: lsu
View: 0 times
Download: 0 times
Share this document with a friend
17
Research Article Kernel Temporal Differences for Neural Decoding Jihye Bae, 1 Luis G. Sanchez Giraldo, 1 Eric A. Pohlmeyer, 2 Joseph T. Francis, 3 Justin C. Sanchez, 2 and José C. Príncipe 1 1 Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA 2 Department of Biomedical Engineering, University of Miami, Coral Gables, FL 33146, USA 3 Department of Physiology and Pharmacology, Robert F. Furchgott Center for Neural & Behavioral Science, SUNY Downstate Medical Center, Brooklyn, NY 11203, USA Correspondence should be addressed to Jihye Bae; [email protected] Received 8 September 2014; Revised 28 January 2015; Accepted 3 February 2015 Academic Editor: Daoqiang Zhang Copyright © 2015 Jihye Bae et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We study the feasibility and capability of the kernel temporal difference (KTD)() algorithm for neural decoding. KTD() is an online, kernel-based learning algorithm, which has been introduced to estimate value functions in reinforcement learning. is algorithm combines kernel-based representations with the temporal difference approach to learning. One of our key observations is that by using strictly positive definite kernels, algorithm’s convergence can be guaranteed for policy evaluation. e algorithm’s nonlinear functional approximation capabilities are shown in both simulations of policy evaluation and neural decoding problems (policy improvement). KTD can handle high-dimensional neural states containing spatial-temporal information at a reasonable computational complexity allowing real-time applications. When the algorithm seeks a proper mapping between a monkey’s neural states and desired positions of a computer cursor or a robot arm, in both open-loop and closed-loop experiments, it can effectively learn the neural state to action mapping. Finally, a visualization of the coadaptation process between the decoder and the subject shows the algorithm’s capabilities in reinforcement learning brain machine interfaces. 1. Introduction Research in brain machine interfaces (BMIs) is a multidis- ciplinary effort involving fields such as neurophysiology and engineering. Developments in this area have a wide range of applications, especially for subjects with neuromuscular disabilities, for whom BMIs may become a significant aid. Neural decoding of motor signals is one of the main tasks that needs to be executed by the BMI. Ideas from system theory can be used to frame the decod- ing problem. Bypassing the body can be achieved by mod- elling the transfer function from brain activity to limb movement and utilizing the output of the properly trained model to control a robotic device to implement the intention of movement. e design of neural decoding systems has been approached using machine learning methods. In order to choose the appropriate learning method, factors such as learning speed and stability help in determining the useful- ness of a particular method. Reinforcement learning brain machine interfaces (RLBMI) [1] have been shown to be a promising avenue for practical implementations. Fast adaptation under changing environments and neural decoding capability of an agent have been shown in [2, 3] using the actor-critic paradigm. Adaptive classification of event-related potential (ERP) in electroencephalography (EEG) using RL in BMI was pro- posed in [4]. Moreover, partially observable Markov decision processes (POMDPs) have been applied in the agent to account for the uncertainty when decoding noisy brain signals [5]. In a RLBMI, a computer agent and a user in the environment cooperate and learn coadaptively. e decoder learns how to correctly translate neural states into action direction pairs that indicate the subject’s intent. In the agent, the proper neural decoding of the motor signals is essential to control an external device that interacts with the physical environment. However, to realize the advantages of RLBMIs in practice, there are several challenges that need to be addressed. Hindawi Publishing Corporation Computational Intelligence and Neuroscience Volume 2015, Article ID 481375, 17 pages http://dx.doi.org/10.1155/2015/481375
Transcript

Research ArticleKernel Temporal Differences for Neural Decoding

Jihye Bae1 Luis G Sanchez Giraldo1 Eric A Pohlmeyer2 Joseph T Francis3

Justin C Sanchez2 and Joseacute C Priacutencipe1

1Department of Electrical and Computer Engineering University of Florida Gainesville FL 32611 USA2Department of Biomedical Engineering University of Miami Coral Gables FL 33146 USA3Department of Physiology and Pharmacology Robert F Furchgott Center for Neural amp Behavioral ScienceSUNY Downstate Medical Center Brooklyn NY 11203 USA

Correspondence should be addressed to Jihye Bae jbae1013gmailcom

Received 8 September 2014 Revised 28 January 2015 Accepted 3 February 2015

Academic Editor Daoqiang Zhang

Copyright copy 2015 Jihye Bae et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We study the feasibility and capability of the kernel temporal difference (KTD)(120582) algorithm for neural decoding KTD(120582) is anonline kernel-based learning algorithm which has been introduced to estimate value functions in reinforcement learning Thisalgorithm combines kernel-based representations with the temporal difference approach to learning One of our key observationsis that by using strictly positive definite kernels algorithmrsquos convergence can be guaranteed for policy evaluation The algorithmrsquosnonlinear functional approximation capabilities are shown in both simulations of policy evaluation and neural decoding problems(policy improvement) KTD can handle high-dimensional neural states containing spatial-temporal information at a reasonablecomputational complexity allowing real-time applicationsWhen the algorithm seeks a propermapping between amonkeyrsquos neuralstates and desired positions of a computer cursor or a robot arm in both open-loop and closed-loop experiments it can effectivelylearn the neural state to action mapping Finally a visualization of the coadaptation process between the decoder and the subjectshows the algorithmrsquos capabilities in reinforcement learning brain machine interfaces

1 Introduction

Research in brain machine interfaces (BMIs) is a multidis-ciplinary effort involving fields such as neurophysiology andengineering Developments in this area have a wide rangeof applications especially for subjects with neuromusculardisabilities for whom BMIs may become a significant aidNeural decoding ofmotor signals is one of themain tasks thatneeds to be executed by the BMI

Ideas from system theory can be used to frame the decod-ing problem Bypassing the body can be achieved by mod-elling the transfer function from brain activity to limbmovement and utilizing the output of the properly trainedmodel to control a robotic device to implement the intentionof movement The design of neural decoding systems hasbeen approached using machine learning methods In orderto choose the appropriate learning method factors such aslearning speed and stability help in determining the useful-ness of a particular method

Reinforcement learning brain machine interfaces(RLBMI) [1] have been shown to be a promising avenue forpractical implementations Fast adaptation under changingenvironments and neural decoding capability of an agenthave been shown in [2 3] using the actor-critic paradigmAdaptive classification of event-related potential (ERP) inelectroencephalography (EEG) using RL in BMI was pro-posed in [4] Moreover partially observable Markov decisionprocesses (POMDPs) have been applied in the agent toaccount for the uncertainty when decoding noisy brainsignals [5] In a RLBMI a computer agent and a user in theenvironment cooperate and learn coadaptively The decoderlearns how to correctly translate neural states into actiondirection pairs that indicate the subjectrsquos intent In the agentthe proper neural decoding of the motor signals is essentialto control an external device that interacts with the physicalenvironment

However to realize the advantages of RLBMIs in practicethere are several challenges that need to be addressed

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2015 Article ID 481375 17 pageshttpdxdoiorg1011552015481375

2 Computational Intelligence and Neuroscience

The neural decoder must be able to handle high-dimensionalneural states containing spatial-temporal information Themapping from neural states to actions must be flexibleenough to avoid making strong assumptions Moreoverthe computational complexity of the decoder should bereasonable such that real time implementations are feasible

Temporal difference learning provides an efficient learn-ing procedure that can be applied to reinforcement learn-ing problems In particular TD(120582) [6] can be applied toapproximate a value function that is utilized to compute anapproximate solution to Bellmanrsquos equation The algorithmallows incremental computation directly from new experi-ence without having an associatedmodel of the environmentThis provides a means to efficiently handle high-dimensionalstates and actions by using an adaptive technique for functionapproximation that can be trained directly from the dataAlso because TD learning allows system updates directlyfrom the sequence of states online learning is possiblewithout having a desired signal at all times

Note that TD(120582) and its variants (least squares TD(120582)[7] recursive least squares TD [8] incremental least squaresTD(120582) [9] Gradient TD [10] and linear TD with gradientcorrection [11]) have been mostly treated in the context ofparametric linear function approximation This can becomea limiting factor in practical applications where little priorknowledge can be incorporated Therefore here our interestfocuses on a more general class of models with nonlinearcapabilities In particular we adopt a kernel-based functionapproximation methodology

Kernel methods are an appealing choice due to theirelegant way of dealing with nonlinear function approxima-tion problems Unlike most of the nonlinear variants of TDalgorithms which are prone to fall into local minima [12 13]the kernel based algorithms have nonlinear approximationcapabilities yet the cost function can be convex [14] One ofthe major appeals of kernel methods is the ability to handlenonlinear operations on the data by an implicit mapping tothe so called feature space (reproducing kernel Hilbert space(RKHS)) which is endowed with an inner product A linearoperation in the RKHS corresponds to a nonlinear operationin the input space In addition algorithms based on kernelmethods are still reasonably easy to compute based on thekernel trick [14]

Temporal difference algorithms based on kernel expan-sions have shown superior performance in nonlinear approx-imation problems The close relation between Gaussian pro-cesses and kernel recursive least squares was exploited in[15] to provide a Bayesian framework for temporal differencelearning Similar work using kernel-based least squares tem-poral difference learning with eligibilities (KLSTD(120582)) wasintroduced in [16] Unlike the Gaussian process temporaldifference algorithm (GPTD) KLSTD(120582) is not a probabilis-tic approach The idea in KLSTD is to extend LSTD(120582) [7]using the concept of duality However the computationalcomplexity of KLSTD(120582) per time update is 119874(1198993) whichprecludes its use for online learning

An online kernel temporal difference learning algorithmcalled kernel temporal differences (KTD)(120582) was proposed in

[17] By using stochastic gradient updates KTD(120582) reducesthe computational complexity from 119874(119899

3) to 119874(119899

2) This

reduction alongwith other capacity controlmechanisms suchas sparsification make real time implementations of KTD(120582)feasible

Even though nonparametric techniques are inherently ofgrowing structure these techniques produce better solutionsthan any other simple linear function approximation meth-odsThis hasmotivated work onmethods that help overcomescalability issues such as the growing filter size [18] In thecontext of kernel based TD algorithms sparsification meth-ods such as approximate linear dependence (ALD) [19] havebeen applied to GPTD [15] and KLSTD [20] A Quantizationapproach proposed in [21] has been used in KTD(120582) [22] In asimilar flavor the kernel distance based online sparsificationmethod was proposed for a KTD algorithm in [23] Notethat ALD is 119874(1198992) complexity whereas quantization andkernel distances are 119874(119899) The main difference between thequantization approach and the kernel distance is the spacewhere the distances are computed Quantization approachuses criterion of input space distances whereas kernel dis-tance computes them in the RKHS associated with the kernel[23]

In this paper we investigate kernel temporal differences(KTD)(120582) [17] for neural decoding We first show the advan-tages of using kernel methods Namely we show that con-vergence results of TD(120582) in policy evaluation carry overKTD(120582) when the kernel is strictly positive definite Examplesof the algorithmrsquos capability for nonlinear function approx-imation are also presented We apply the KTD algorithmto neural decoding in open-loop and closed-loop RLBMIexperiments where the algorithmrsquos ability to find proper neu-ral state to action map is verified In addition the trade offbetween the value function estimation accuracy and compu-tation complexity under growing filter size is studied Finallywe provide visualizations of the coadaptation between thedecoder and the subject highlighting the usefulness ofKTD(120582) for reinforcement learning brainmachine interfaces

This paper starts with a general background on rein-forcement learning which is given in Section 2 Section 3introduces the KTD algorithm and provides its convergenceproperties for policy evaluationThis algorithm is extended inSection 4 to policy improvement using 119876-learning Section 5introduces some of the kernel sparsification methods for theKTD algorithm that address the naturally growing structureof kernel adaptive algorithms Section 6 shows empiricalresults on simulations for policy evaluation and Section 7presents experimental results and comparisons to othermethods in neural decoding using real data sets for bothopen-loop and closed-loop RLBMI frameworks Conclusionsare provided in Section 8

2 Reinforcement Learning Brain MachineInterfaces and Value Functions

In reinforcement learning brainmachine interfaces (RLBMI)a neural decoder interacts with environment over timeand adjusts its behavior to improve performance [1]

Computational Intelligence and Neuroscience 3

Action

Reward

State

State

Computer cursorrobot arm

Target

Adaptive system

Kernel temporal

TD error

Value function Policy

x(n)

Q

r(n + 1)

a(n + 1)

x(n + 1)

differences (120582)

BMI userrsquosbrain

Agent (BMI decoder)Environment

Figure 1 The decoding structure of reinforcment learning modelin a brain machine interface using a 119876-learning based functionapproximation algorithm

The controller in the BMI can be considered as a neu-ral decoder and the environment includes the BMI user(Figure 1)

Assuming the environment is a stochastic and stationaryprocess that satisfies the Markov condition it is possibleto model the interaction between the learning agent andthe environment as a Markov decision process (MDP) Forthe sake of simplicity we assume the states and actions arediscrete but they can also be continuous

At time step 119899 the decoder receives the representationof the userrsquos neural state 119909(119899) isin X as input According tothis input the decoder selects an action 119886(119899) isin A whichcauses the state of the external device to change namely theposition of a cursor on a screen or a robotrsquos arm positionBased on the updated position the agent receives a reward119903(119899 + 1) isin R At the same time the updated position ofthe actuator will influence the userrsquos subsequent neural statesthat is going from 119909(119899) to 119909(119899 + 1) because of the visualfeedback involved in the process The new state 119909(119899 + 1)

follows the state transition probabilityP1198861199091199091015840 given the action

119886(119899) and the current state 119909(119899) At the new state 119909(119899 + 1) theprocess repeats the decoder takes an action 119886(119899+1) and thiswill result in a reward 119903(119899 + 2) and a state transition from119909(119899+ 1) to 119909(119899+ 2) This process continues either indefinitelyor until a terminal state is reached depending on the process

Note that the user has no direct access to actions and thedecoder must interpret the userrsquos brain activity correctly tofacilitate the rewards Also both systems act symbiotically bysharing the external device to complete their tasks Throughiterations both systems learn how to earn rewards basedon their joint behavior This is how the two intelligentsystems (the decoder and the user) learn coadaptively and theclosed loop feedback is created This coadaptation allows forcontinuous synergistic adaptation between the BMI decoderand the user even in changing environments [1]

The value function is a measure of long-term perfor-mance of an agent following a policy 120587 starting from a state119909(119899) The state value function is defined as

119881120587(119909 (119899)) = 119864120587 [R (119899) | 119909 (119899)] (1)

and action value function is given by

119876120587(119909 (119899) 119886 (119899)) = 119864120587 [R (119899) | 119909 (119899) 119886 (119899)] (2)

whereR(119899) is known as the return Here we apply a commonchoice for the return the infinite-horizon discounted model

R (119899) =infin

sum

119896=0

120574119896119903 (119899 + 119896 + 1) 0 lt 120574 lt 1 (3)

that takes into account the rewards in the long run but weighsthem with a discount factor to prevent the function fromgrowing unbounded as 119896 rarr infin and provides mathematicaltractability [24]Note that our goal is to find a policy120587 X rarr

A which maps a state 119909(119899) to an action 119886(119899) Estimating thevalue function is an essential step towards finding a properpolicy

3 Kernel Temporal Difference(120582)

In this section we provide a brief introduction to kernelmethods followed by the derivation of the KTD algorithm[17 22] One of the contributions of the present work is theconvergence analysis of KTD(120582) presented at the end of thissection

31 Kernel Methods Kernel methods are a family of algo-rithms for which input data are nonlinearly map to a high-dimensional feature space of vectors where linear operationsare carried out Let X be a nonempty set For a positivedefinite function 120581 X times X rarr R [14 25] there exists aHilbert spaceH and a mapping 120601 X rarr H such that

120581 (119909 119910) = ⟨120601 (119909) 120601 (119910)⟩ (4)

The inner product in the high-dimensional feature space canbe calculated by evaluating the kernel function in the inputspace Here H is called a reproducing kernel Hilbert space(RKHS) for which the following property holds

119891 (119909) = ⟨119891 120601 (119909)⟩ = ⟨119891 120581 (119909 sdot)⟩ forall119891 isinH (5)

The mapping implied by the use of the kernel function120581 can also be understood through Mercerrsquos Theorem [26]The implicit map 120601 allows one to transform conventionallinear algorithms in the feature space to nonlinear systemsin the input space and the kernel function 120581 provides animplicit way to compute inner products in the RKHS withoutexplicitly dealing with the high-dimensional space

32 Kernel Temporal Difference(120582) In the multistep pre-diction problem we consider a sequence of input-outputpairs (119909(1) 119889(1)) (119909(2) 119889(2)) (119909(119898) 119889(119898)) for whichthe desired output 119889 is only available at time 119898 + 1Consequently the system should produce a sequence ofpredictions 119910(1) 119910(2) 119910(119898) based solely on the observedinput sequences before it gets access to the desired responseIn general the predicted output is a function of all previousinputs 119910(119899) = 119891(119909(1) 119909(2) 119909(119899)) Here we assume that119910(119899) = 119891(119909(119899)) for simplicity and let the function 119891 belongto a RKHSH

In supervised learning by treating the observed inputsequence and the desired prediction as a sequence of pairs

4 Computational Intelligence and Neuroscience

(119909(1) 119889) (119909(2) 119889) (119909(119898) 119889) andmaking119889 ≜ 119910(119898+1) wecan obtain the updates of function119891 after the whole sequenceof119898 inputs has been observed as

119891 larr997888 119891 +

119898

sum

119899=1

Δ119891119899 (6)

= 119891 + 120578

119898

sum

119899=1

[119889 minus 119891 (119909 (119899))] 120601 (119909 (119899)) (7)

Here Δ119891119899= 120578[119889 minus ⟨119891 120601(119909(119899))⟩]120601(119909(119899)) are the instantaneous

updates of the function119891 from input data based on the kernelexpansion (5)

The key observation to extend the supervised learningapproach to the TD method is that the difference betweendesired and predicted output at time 119899 can be written as

119889 minus 119910 (119899) =

119898

sum

119896=119899

(119910 (119896 + 1) minus 119910 (119896)) (8)

where 119910(119898 + 1) ≜ 119889 Using this expansion in terms of thedifferences between sequential predictions we can update thesystem at each time step By replacing the error 119889 minus 119891(119909(119899))in (7) using the relation with temporal differences (8) andrearranging the equation as in [6] we obtain the followingupdate

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119891 (119909 (119899 + 1)) minus 119891 (119909 (119899))]

119899

sum

119896=1

120601 (119909 (119896)) (9)

In this case all predictions are used equally Using exponen-tial weighting on recency yields the following update rule

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119891 (119909 (119899 + 1)) minus 119891 (119909 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(10)

Here 120582 represent an eligibility trace rate that is added to theaveraging process over temporal differences to emphasize onthe most recently observed states and to efficiently deal withdelayed rewards

The above update rule (10) is called kernel temporaldifference (KTD)(120582) [17]The difference between predictionsof sequential inputs is called temporal difference (TD) error

119890TD (119899) = 119891 (119909 (119899 + 1)) minus 119891 (119909 (119899)) (11)

Note that the temporal differences 119891(119909(119899 + 1)) minus 119891(119909(119899)) canbe rewritten using the kernel expansions as ⟨119891 120601(119909(119899+ 1))⟩ minus⟨119891 120601(119909(119899))⟩This yields the instantaneous update of the func-tion 119891 as Δ119891

119899= 120578⟨119891 120601(119909(119899+1))minus120601(119909(119899))⟩sum

119899

119896=1120582119899minus119896120601(119909(119896))

Using the RKHS properties the evaluation of the function 119891at a given 119909 can be calculated from the kernel expansion

In reinforcement learning the prediction 119910(119899) = 119891(119909(119899))can be considered as the value function (1) or (2) Thisis how the KTD algorithm provides a nonlinear functionapproximation to Bellmanrsquos equation When the prediction119910(119899) represents the state value function the TD error (11)

is extended to the combination of a reward and sequentialvalue function predictions For instance in the case of policyevaluation the TD error is defined as

119890TD (119899) = 119903 (119899 + 1) + 120574119881 (119909 (119899 + 1)) minus 119881 (119909 (119899)) (12)

33 Convergence of Kernel Temporal Difference(120582) It hasbeen shown in [6 27] that for an absorbing Markov chainTD(120582) converges with probability 1 under certain conditionsRecall that the conventional TD algorithm assumes thefunction class to be linearly parametrized satisfying119910 = 119908⊤119909KTD(120582) can be viewed as a linear function approximation inthe RKHS Using this relation convergence of KTD(120582) canbe obtained as an extension of the convergence guaranteesalready established for TD(120582)

When 120582 = 1 by definition the KTD(120582 = 1) procedureis equivalent to the supervised learning method (7) KTD(1)yields the same per-sequence weight changes as the leastsquare solution since (9) is derived directly from supervisedlearning by replacing the error term in (8) Thus the conver-gence of KTD(1) can be established based on the convergenceof its equivalent supervised learning formulation which wasproven in [25]

Proposition 1 TheKLMS algorithm converges asymptoticallyin themean sense to the optimal solution under the ldquosmall-step-sizerdquo condition

Theorem 2 When the stepsize 120578119899satisfies 120578

119899ge 0 suminfin

119899=1120578119899=

infin and suminfin119899=1

1205782

119899lt infin KTD(1) converges asymptotically in the

mean sense to the least square solution

Proof Since by (8) the sequence of TD errors can be replacedby amultistep prediction with error 119890(119899) = 119889minus119910(119899) the resultof Proposition 1 also applies to this case

In the case of 120582 lt 1 as shown by [27] the convergenceof linear TD(120582) can be proved based on the ordinarydifferential equation (ODE) method introduced in [28] Thisresult can be easily extended to KTD(120582) as follows Letus consider the Markov estimation problem as in [6] AnabsorbingMarkov chain can be described by the terminal andnonterminal sets of states T and N transition probabilities119901119894119895between nonterminal states the transition probabilities

119904119894119895from nonterminal states to terminal states the vectors 119909

119894

representing the nonterminal states the expected terminalreturns 119889

119895from the 119895th terminal state and the probabilities

120583119894of starting at state 119894 Given an initial state 119894 isin N an

absorbing Markov chain generates an observation sequenceof 119898 vectors 119909

1198941

1199091198942

119909119894119898

where the last element 119909119894119898

ofthe sequence corresponds to a terminal state 119894

119898isin T The

expected outcome 119889 given a sequence starting at 119894 isin N isgiven by

119890lowast

119894equiv 119864 [119889 | 119894] (13)

= sum

119895isinT

119904119894119895119889119895+ sum

119895isinN

119901119894119895sum

119896isinT

119901119895119896119889119896+ sdot sdot sdot (14)

Computational Intelligence and Neuroscience 5

= [

infin

sum

119896=0

119876119896ℎ]

119894

= [(119868 minus 119876)minus1ℎ]119894 (15)

where [119909]119894denotes the 119894th element of the array 119909 119876 is the

transition matrix with entries [119876]119894119895= 119901119894119895for 119894 119895 isin N and

[ℎ]119894= sum119895isinT 119904119894119895119889119895 for 119894 isin N In linear TD(120582) a sequence

of vectors 1199081 1199082 is generated Each one of these vectors

119908119899is generated after having a complete observation sequence

that is a sequence staring at state 119894 isin N and ending at state119895 isin T with the respective return 119889

119895 Similar to linear TD(120582)

inKTD(120582)we have a sequence of functions1198911 1198912 (vectors

in a RKHS) for which we can also write a linear updateof the mean estimates of terminal return after 119899 sequenceshave been observed If 119891

119899is the actual function estimate after

sequence 119899 and 119891119899+1

is the expected function estimate afterthe next sequence we have that

119891119899+1

(119883) = 119891119899 (119883) + 120578119899+1H (119891119899 (119883) minus 119890

lowast) (16)

where H = minusK119863[119868 minus (1 minus 120582)119876(119868 minus 120582119876)minus1] [K]119894119895= 120581(119909

119894 119909119895)

with 119894 119895 isin N 119863 is a diagonal matrix and [119863]119894119894the expected

number of times the state 119894 is visited during a sequence and119891119899(119883) is a column vector of function evaluations of the state

representations such that [119891119899(119883)]119894= 119891119899(119909119894) = ⟨119891

119899 120601(119909119894)⟩

Analogously to [27] the mean estimates in (16) convergeappropriately if H has a full set of eigenvalues with negativereal parts for which we need K to be full rank For the aboveto be true it is required the set of vectors 120601(119909

119894)119894isinN to be

linearly independent in the RKHS This is exactly the casewhen the kernel 120581 is strictly positive definite as shown in thefollowing proposition

Proposition 3 If 120581 XtimesX rarr R is a strictly positive definitekernel for any finite set 119909

119894119873

119894=1sube X of distinct elements the set

120601(119909119894) is linearly independent

Proof If 120581 is strictly positive definite thensum120572119894120572119895120581(119909119894 119909119895) gt 0

for any set 119909119894where 119909

119894= 119909119895 for all 119894 = 119895 and any 120572

119894isin R

such that not all 120572119894= 0 Suppose there exists a set 119909

119894 for

which 120601(119909119894) are not linearly independent Then there must

be a set of coefficients 120572119894isin R not all equal to zero such that

sum120572119894120601(119909119894) = 0 which implies that sum120572

119894120601(119909119894)2= 0

0 = sum120572119894120572119895⟨120601 (119909119894) 120601 (119909

119895)⟩ = sum120572

119894120572119895120581 (119909119894 119909119895) (17)

which contradicts the assumption

The following Theorem is the resulting extension ofTheorem 119879 in [27] to KTD(120582)

Theorem4 For any absorbingMarkov chain for any distribu-tion of starting probailities120583

119894such that there are not inaccessible

states for any outcome distributions with finite expected values119889119895 for any strictly positive definite kernel 120581 and any set of

observation vectors 119909119894 119894 isin N such that 119909

119894= 119909119895if and only if

119894 = 119895 there exists an 120598 gt 0 such that if 120578119899= 120578 where 0 lt 120578 lt 120598

and for any initial function estimate the predictions of KTD(120582)converge in expected value to the ideal predictions of (15) If 119891

119899

denotes the function estimate after experiencing 119899 sequencesthen

lim119899rarrinfin

119864 [119891119899(119909119894)] = 119864 [119889 | 119894] = [(119868 minus 119876)

minus1ℎ]119894 forall119894 isinN

(18)

4 119876-Learning via Kernel TemporalDifferences(120582)

Since the value function represents the expected cumulativerewards given a policy the policy 120587 is better than the policy1205871015840 when the policy 120587 gives greater expected return than the

policy 1205871015840 In other words 120587 ge 1205871015840 if and only if 119876120587(119909 119886) ge

1198761205871015840

(119909 119886) for all 119909 isin X and 119886 isin A Therefore the optimalaction value function 119876 can be written as 119876lowast(119909(119899) 119886(119899)) =max120587119876120587(119909(119899) 119886(119899)) The estimation can be done online To

maximize the expected reward 119864[119903(119899 + 1) | 119909(119899) 119886(119899) 119909(119899 +1)] one-step 119876-learning update was introduced in [29]

119876 (119909 (119899) 119886 (119899)) larr997888 119876 (119909 (119899) 119886 (119899))

+ 120578 [119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

(19)

At time 119899 an action 119886(119899) can be selected using methods suchas 120598-greedy or the Boltzmann distribution which are popularfor exploration and exploitation trade-off [30]

When we consider the prediction 119910 as action value func-tion 119876120587 with respect to a policy 120587 KTD(120582) can approximatethe value function119876120587 using a family of functions of the form

119876 (119909 (119899) 119886 = 119894) = 119891 (119909 | 119886 = 119894) = ⟨119891 120601 (119909 (119899))⟩ (20)

Here 119876(119909(119899) 119886 = 119894) denotes a state-action value given astate 119909(119899) at time 119899 and a discrete action 119894 Therefore theupdate rule for119876-learning via kernel temporal difference (119876-KTD)(120582) can be written as

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(21)

We can see that the temporal difference (TD) error at time 119899includes reward and action value function terms For single-step prediction problems (119898 = 1) (10) yields single updatesfor 119876-KTD(120582) of the form

119876119894(119909 (119899)) = 120578

119899minus1

sum

119895=1

119890TD119894 (119895) 119868119896 (119895) 120581 ⟨119909 (119899) 119909 (119895)⟩ (22)

Here 119876119894(119909(119899)) = 119876(119909(119899) 119886 = 119894) and 119890TD119894(119899) denotes the TD

error defined as 119890TD119894(119899) = 119903119894 + 120574119876119894119894(119909(119899 + 1)) minus 119876119894(119909(119899)) and119868119896(119899) is an indicator vector of size determined by the number

6 Computational Intelligence and Neuroscience

CalculaterewardExploration

Exploitation

Statevector

Actionx(n)

x(n)

x(1)

x(2)

x(3)

x(n minus 2)

x(n minus 1)

Σ

Σ

ΣQi(x(n))

Reward selected Q value

a(n) = 120578Σnj=1120582

nminusjIk(j)eTD(j)

Figure 2 The structure of 119876-learning via kernel temporal difference(120582)

of outputs (actions) Only the 119896th entry of the vector is setto 1 and the other entries are set to 0 The selection of theaction unit 119896 at time 119899 can be based on a greedy methodTherefore only the weight (parameter vector) correspondingto the winning action gets updated Recall that the reward 119903

119894

corresponds to the action selected by the current policy withinput 119909(119899) because it is assumed that this action causes thenext input state 119909(119899 + 1)

The structure of 119876-learning via KTD(0) is shown inFigure 2 The number of units (kernel evaluations) increasesas more input data arrives Each added unit is centered at theprevious input locations 119909(1) 119909(2) 119909(119899 minus 1)

In the reinforcement learning brain machine interface(RLBMI) paradigm kernel temporal difference(120582) helpsmodel the agent (see Figure 1) The action value function119876 can be approximated using KTD(120582) for which the ker-nel based representations enhance the functional mappingcapabilities of the system Based on the estimated 119876 valuesa policy decides a proper action Note that the policy cor-responds to the learning policy which changes over time in119876-learning

5 Online Sparsification

One characteristic of nonparametric approaches is theirinherently growing structure which is usually linear in thenumber of input data points This rate of growth becomesprohibitive for practical applications that handle increasingamounts of incoming data over time Various methods havebeen proposed to alleviate this problem (see [31] and refer-ences therein)Thesemethods known as kernel sparsificationmethods can be applied to the KTD algorithm to controlthe growth of the terms in the function expansion also

known as filter size Popular examples of kernel sparsificationmethods are the approximate linear dependence (ALD) [19]Surprise criterion [32] Quantization approach [21] andthe kernel distance based method [23] The main idea ofsparsification is to only consider a reduced set of samplescalled the dictionary to represent the function of interestThecomputational complexity ofALD is119874(1198892) where119889 is the sizeof the dictionary For the other methods mentioned abovethe complexity is 119874(119889)

Each of these methods has its own criterion to determinewhether an incoming sample should be added to the currentdictionary The Surprise criterion [32] measures the subjec-tive information of exemplar 119909 119889 with respect to a learningsystem Γ

119878Γ(119909 119889) = minus ln119901 (119909 119889 | Γ) (23)

Only samples with high values of Surprise are considered ascandidates for the dictionary In the case of the Quantizationapproach introduced in [21] the distance between a newinput 119909(119899) and the existing dictionary elements 119862(119899 minus 1) isevaluated The new input sample is added to the dictionaryif the distance between the new input 119909(119899) and the closestelement in 119862(119899 minus 1)

min119909119894isin119862(119899minus1)

1003817100381710038171003817119909 (119899) minus 1199091198941003817100381710038171003817 gt 120598119880 (24)

is larger than the Quantization size 120598119880 Otherwise the new

input state 119909(119899) is absorbed by the closest existing unit Verysimilar to the quantization approach the method presentedin [23] applies a distance threshold criterion in the RKHSThe kernel distance based criterion given a state dictionary

Computational Intelligence and Neuroscience 7

119863(119899 minus 1) adds a new unit when the new input state 119909(119899)satisfies following condition

min119909119894isin119863(119899minus1)

1003817100381710038171003817120601(119909(119899)) minus 120601(119909119894)10038171003817100381710038172gt 1205831 (25)

For some kernels such as Gaussian the Quantizationmethodand the kernel distance based criterion can be shown to beequivalent

6 Simulations

Note that the KTD algorithm has been introduced for valuefunction estimation To evaluate the algorithmrsquos nonlinearcapability we first examine the performance of theKTD(120582) inthe problem of state value function estimation given a fixedpolicy 120587 We carry out experiments on a simple illustrativeMarkov chain initially described in [33] This is a popularexperiment involving an episodic task to test TD learningalgorithms The experiment is useful in illustrating linear aswell as nonlinear functions of the state representations andshows how the state value function is estimated using theadaptive system

61 Linear Case Even though we emphasize the capabilityof KTD(120582) as a nonlinear function approximator underthe appropriate kernel size KTD(120582) should approximatelinear functions on a region of interest as well To test itsefficacy we observe the performance on a simple Markovchain (Figure 3) There are 13 states numbered from 12 to0 Each trial starts at state 12 and terminates at state 0Each state is represented by a 4-dimensional vector and therewards are assigned in such a way that the value function119881 is a linear function on the states namely 119881lowast takes thevalues [0 minus2 minus4 minus22 minus24] at states [0 1 2 11 12]In the case of 119881 = 119908

⊤119909 the optimal weights are 119908lowast =

[minus24 minus16 minus8 0]To assess the performance the updated estimate of the

state value function (119909) is compared to the optimal valuefunction119881lowast at the end of each trialThis is done by computingthe RMS error of the value function over all states

RMS = radic1

119899sum

119909isinX

(119881lowast(119909) minus (119909))2

(26)

where 119899 is the number of states 119899 = 13Stepsize scheduling is applied as follows

120578 (119899) = 1205780

1198860+ 1

1198860+ 119899

where 119899 = 1 2 (27)

where 1205780is the initial stepsize and 119886

0is the annealing

factor which controls how fast the stepsize decreases In thisexperiment 119886

0= 100 is applied Furthermore we assume that

the policy 120587 is guaranteed to terminate which means that thevalue function 119881120587 is well-behaved without using a discountfactor 120574 in (3) that is 120574 = 1

In KTD(120582) we employ the Gaussian kernel

120581 (119909 (119894) 119909 (119895)) = exp(minus1003817100381710038171003817119909(119894) minus 119909(119895)

10038171003817100381710038172

2ℎ2) (28)

Start11 10

End

3 2 1middot middot middot

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3 minus3

minus3

minus3

minus3 minus212 0

[1 0 0 0]

[34 14 0 0]

[0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

[12 12 0 0]

Figure 3 A 13-state Markov chain [33] For states from 2 to 12the state transition probabilities are 05 and the correspondingrewards are minus3 State 1 has state transition probability of 1 to theterminal state 0 and a reward of minus2 States 12 8 4 and 0 havethe 4-dimensional state space representations [1 0 0 0] [0 1 0 0][0 0 1 0] and [0 0 0 1] respectively The representations of theother states are linear interpolations between the above vectors

which is a universal kernel commonly encountered in prac-tice To find the optimal kernel size we fix all the other freeparameters around median values 120582 = 04 and 120578

0= 05

and the average RMS error over 10 Monte Carlo runs iscompared For this specific experiment smaller kernel sizesyield better performance since the state representations arefinite However in general applying too small kernel sizesleads to over-fitting and slow learning In particular choosinga very small kernel makes the algorithm behave very similarto the table look up method Thus we choose the kernel sizeℎ = 02 to be the largest kernel size for which we obtainsimilar mean RMS values as for smaller kernel sizes

After fixing the kernel size to ℎ = 02 the experimentalevaluation of different combinations of eligibility trace rates120582 and initial step sizes 120578

0are observed Figure 4 shows the

average performance over 10 Monte Carlo runs for 1000trials

All 120582 values with optimal stepsize show good approxima-tion to 119881lowast after 1000 trials Notice that KTD(120582 = 0) showsslightly better performance than KTD(120582 = 1) This may beattributed to the local nature ofKTDwhenusing theGaussiankernel In addition varying the stepsize has a relatively smalleffect on KTD(120582) The Gaussian kernel as well as other shift-invariant kernels provide an implicit normalized update rulewhich is known to be less sensitive to stepsize Based onFigure 4 the optimal eligibility trace rate and initial stepsizevalue 120582 = 06 and 120578

0= 03 are selected for KTD with kernel

size ℎ = 02The learning curve of KTD(120582) is compared to the con-

ventional TD algorithm TD(120582) The optimal parametersemployed in both algorithms are based on the experimentalevaluation In TD(120582) 120582 = 1 and 120578

0= 01 are applied The

RMS error is averaged over 50 Monte Carlo runs for 1000trials Comparative learning curves are given in Figure 5

In this experiment we confirm the ability of KTD(120582) tohandle the function approximation problem when the fixedpolicy yields a state value function that is linear in the staterepresentation Both algorithms reach the mean RMS valueof around 006 As we expected TD(120582) converges faster to theoptimal solution because of the linear nature of the problemKTD(120582) converges slower than TD(120582) but it is also ableto approximate the value function properly In this sense

8 Computational Intelligence and Neuroscience

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 4 Performance comparison over different combinations ofeligibility trace rates 120582 and initial step sizes 120578

0in KTD(120582) with ℎ =

02 The vertical line segments contain the mean RMS value after100 trials (top marker) 500 trials (middle marker) and 1000 trials(bottom marker)

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDTD

Figure 5 Learning curve of KTD(120582) and TD(120582) The solid lineshows the mean RMS error and the dashed line shows the +minusstandard deviations over 50Monte Carlo runs

the KTD algorithm is open to wider class of problems thanits linear counterpart

62 Nonlinear Case Previous section show the performancesof KTD(120582) on the problem of estimating a state value

Start11 10

End

3 2 112 0middot middot middot

[1 0 0 0]

[34 14 0 0]

[12 12 0 0] [0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

minus8

minus8

minus4 minus4

minus4

minus2 minus2

minus2minus2

minus1

minus1

minus05

minus05 minus02

Figure 6 A 13-state Markov chain In states from 2 to 12 each statetransition has probability 05 and state 1 has transition probability1 to the absorbing state 0 Note that optimal state value functionscan be represented as a nonlinear function of the states andcorresponding reward values are assigned to each state

function which is a linear function of the given state repre-sentation The same problem can be turned into a nonlinearone bymodifying the reward values in the chain such that theresulting state value function119881lowast is no longer a linear functionof the states

The number of states and the state representations remainthe same as in the previous section However the optimalvalue function 119881

lowast becomes nonlinear with respect tothe representation of the states namely119881lowast = [0 minus02 minus06

minus 14 minus 3 minus 62 minus 126 minus 134 minus 135 minus 1445 minus 15975

minus 192125 minus 255938] for states 0 to 12 This implies that thereward values for each state are different from the ones givenfor the linear case (Figure 6)

Again to evaluate the performance after each trial iscompleted the estimated state value is compared to theoptimal state value 119881lowast using RMS error (26) For KTD(120582)the Gaussian kernel (28) is applied and kernel size ℎ = 02 ischosen Figure 7 shows the average RMS error over 10MonteCarlo runs for 1000 trials

The combination of 120582 = 04 and 1205780= 03 shows the best

performance but the 120582 = 0 case also shows good perfor-mances Unlike TD(120582) [6] there is no dominant value for 120582in KTD(120582) Recall that it has been proved that convergenceis guaranteed for linearly independent representations of thestates which is automatically fulfilled in KTD(120582) when thekernel is strictly positive definite Therefore the differencesare rather due to the convergence speed controlled by theinteraction between the step size and the elegibilty trace

The average RMS error over 50Monte Carlo runs is com-pared with Gaussian process temporal difference (GPTD)[15] and TD(120582) in Figure 8The purpose of GPTD implemen-tation is to have comparison among kernelized value functionapproximations Here the applied optimal parameters forKTD(120582) are 120582 = 04 120578

0= 03 and ℎ = 02 for GPTD 120582 = 1

1205902= 05 and ℎ = 02 and for TD(120582) 120582 = 08 and 120578

0= 01

The linear function approximation TD(120582) (blue line)cannot estimate the optimal state values KTD(120582) outper-forms the linear algorithm as expected since the Gaussiankernel is strictly positive definite GPTD also learns the targetstate values but the system fails to reach as low error valuesas KTD GPTD is sensitive to the selection of the covariancevalue in the noise1205902 if the value is small the system becomesunstable and larger values cause the the learning to slowdown GPTD models the residuals the difference between

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

2 Computational Intelligence and Neuroscience

The neural decoder must be able to handle high-dimensionalneural states containing spatial-temporal information Themapping from neural states to actions must be flexibleenough to avoid making strong assumptions Moreoverthe computational complexity of the decoder should bereasonable such that real time implementations are feasible

Temporal difference learning provides an efficient learn-ing procedure that can be applied to reinforcement learn-ing problems In particular TD(120582) [6] can be applied toapproximate a value function that is utilized to compute anapproximate solution to Bellmanrsquos equation The algorithmallows incremental computation directly from new experi-ence without having an associatedmodel of the environmentThis provides a means to efficiently handle high-dimensionalstates and actions by using an adaptive technique for functionapproximation that can be trained directly from the dataAlso because TD learning allows system updates directlyfrom the sequence of states online learning is possiblewithout having a desired signal at all times

Note that TD(120582) and its variants (least squares TD(120582)[7] recursive least squares TD [8] incremental least squaresTD(120582) [9] Gradient TD [10] and linear TD with gradientcorrection [11]) have been mostly treated in the context ofparametric linear function approximation This can becomea limiting factor in practical applications where little priorknowledge can be incorporated Therefore here our interestfocuses on a more general class of models with nonlinearcapabilities In particular we adopt a kernel-based functionapproximation methodology

Kernel methods are an appealing choice due to theirelegant way of dealing with nonlinear function approxima-tion problems Unlike most of the nonlinear variants of TDalgorithms which are prone to fall into local minima [12 13]the kernel based algorithms have nonlinear approximationcapabilities yet the cost function can be convex [14] One ofthe major appeals of kernel methods is the ability to handlenonlinear operations on the data by an implicit mapping tothe so called feature space (reproducing kernel Hilbert space(RKHS)) which is endowed with an inner product A linearoperation in the RKHS corresponds to a nonlinear operationin the input space In addition algorithms based on kernelmethods are still reasonably easy to compute based on thekernel trick [14]

Temporal difference algorithms based on kernel expan-sions have shown superior performance in nonlinear approx-imation problems The close relation between Gaussian pro-cesses and kernel recursive least squares was exploited in[15] to provide a Bayesian framework for temporal differencelearning Similar work using kernel-based least squares tem-poral difference learning with eligibilities (KLSTD(120582)) wasintroduced in [16] Unlike the Gaussian process temporaldifference algorithm (GPTD) KLSTD(120582) is not a probabilis-tic approach The idea in KLSTD is to extend LSTD(120582) [7]using the concept of duality However the computationalcomplexity of KLSTD(120582) per time update is 119874(1198993) whichprecludes its use for online learning

An online kernel temporal difference learning algorithmcalled kernel temporal differences (KTD)(120582) was proposed in

[17] By using stochastic gradient updates KTD(120582) reducesthe computational complexity from 119874(119899

3) to 119874(119899

2) This

reduction alongwith other capacity controlmechanisms suchas sparsification make real time implementations of KTD(120582)feasible

Even though nonparametric techniques are inherently ofgrowing structure these techniques produce better solutionsthan any other simple linear function approximation meth-odsThis hasmotivated work onmethods that help overcomescalability issues such as the growing filter size [18] In thecontext of kernel based TD algorithms sparsification meth-ods such as approximate linear dependence (ALD) [19] havebeen applied to GPTD [15] and KLSTD [20] A Quantizationapproach proposed in [21] has been used in KTD(120582) [22] In asimilar flavor the kernel distance based online sparsificationmethod was proposed for a KTD algorithm in [23] Notethat ALD is 119874(1198992) complexity whereas quantization andkernel distances are 119874(119899) The main difference between thequantization approach and the kernel distance is the spacewhere the distances are computed Quantization approachuses criterion of input space distances whereas kernel dis-tance computes them in the RKHS associated with the kernel[23]

In this paper we investigate kernel temporal differences(KTD)(120582) [17] for neural decoding We first show the advan-tages of using kernel methods Namely we show that con-vergence results of TD(120582) in policy evaluation carry overKTD(120582) when the kernel is strictly positive definite Examplesof the algorithmrsquos capability for nonlinear function approx-imation are also presented We apply the KTD algorithmto neural decoding in open-loop and closed-loop RLBMIexperiments where the algorithmrsquos ability to find proper neu-ral state to action map is verified In addition the trade offbetween the value function estimation accuracy and compu-tation complexity under growing filter size is studied Finallywe provide visualizations of the coadaptation between thedecoder and the subject highlighting the usefulness ofKTD(120582) for reinforcement learning brainmachine interfaces

This paper starts with a general background on rein-forcement learning which is given in Section 2 Section 3introduces the KTD algorithm and provides its convergenceproperties for policy evaluationThis algorithm is extended inSection 4 to policy improvement using 119876-learning Section 5introduces some of the kernel sparsification methods for theKTD algorithm that address the naturally growing structureof kernel adaptive algorithms Section 6 shows empiricalresults on simulations for policy evaluation and Section 7presents experimental results and comparisons to othermethods in neural decoding using real data sets for bothopen-loop and closed-loop RLBMI frameworks Conclusionsare provided in Section 8

2 Reinforcement Learning Brain MachineInterfaces and Value Functions

In reinforcement learning brainmachine interfaces (RLBMI)a neural decoder interacts with environment over timeand adjusts its behavior to improve performance [1]

Computational Intelligence and Neuroscience 3

Action

Reward

State

State

Computer cursorrobot arm

Target

Adaptive system

Kernel temporal

TD error

Value function Policy

x(n)

Q

r(n + 1)

a(n + 1)

x(n + 1)

differences (120582)

BMI userrsquosbrain

Agent (BMI decoder)Environment

Figure 1 The decoding structure of reinforcment learning modelin a brain machine interface using a 119876-learning based functionapproximation algorithm

The controller in the BMI can be considered as a neu-ral decoder and the environment includes the BMI user(Figure 1)

Assuming the environment is a stochastic and stationaryprocess that satisfies the Markov condition it is possibleto model the interaction between the learning agent andthe environment as a Markov decision process (MDP) Forthe sake of simplicity we assume the states and actions arediscrete but they can also be continuous

At time step 119899 the decoder receives the representationof the userrsquos neural state 119909(119899) isin X as input According tothis input the decoder selects an action 119886(119899) isin A whichcauses the state of the external device to change namely theposition of a cursor on a screen or a robotrsquos arm positionBased on the updated position the agent receives a reward119903(119899 + 1) isin R At the same time the updated position ofthe actuator will influence the userrsquos subsequent neural statesthat is going from 119909(119899) to 119909(119899 + 1) because of the visualfeedback involved in the process The new state 119909(119899 + 1)

follows the state transition probabilityP1198861199091199091015840 given the action

119886(119899) and the current state 119909(119899) At the new state 119909(119899 + 1) theprocess repeats the decoder takes an action 119886(119899+1) and thiswill result in a reward 119903(119899 + 2) and a state transition from119909(119899+ 1) to 119909(119899+ 2) This process continues either indefinitelyor until a terminal state is reached depending on the process

Note that the user has no direct access to actions and thedecoder must interpret the userrsquos brain activity correctly tofacilitate the rewards Also both systems act symbiotically bysharing the external device to complete their tasks Throughiterations both systems learn how to earn rewards basedon their joint behavior This is how the two intelligentsystems (the decoder and the user) learn coadaptively and theclosed loop feedback is created This coadaptation allows forcontinuous synergistic adaptation between the BMI decoderand the user even in changing environments [1]

The value function is a measure of long-term perfor-mance of an agent following a policy 120587 starting from a state119909(119899) The state value function is defined as

119881120587(119909 (119899)) = 119864120587 [R (119899) | 119909 (119899)] (1)

and action value function is given by

119876120587(119909 (119899) 119886 (119899)) = 119864120587 [R (119899) | 119909 (119899) 119886 (119899)] (2)

whereR(119899) is known as the return Here we apply a commonchoice for the return the infinite-horizon discounted model

R (119899) =infin

sum

119896=0

120574119896119903 (119899 + 119896 + 1) 0 lt 120574 lt 1 (3)

that takes into account the rewards in the long run but weighsthem with a discount factor to prevent the function fromgrowing unbounded as 119896 rarr infin and provides mathematicaltractability [24]Note that our goal is to find a policy120587 X rarr

A which maps a state 119909(119899) to an action 119886(119899) Estimating thevalue function is an essential step towards finding a properpolicy

3 Kernel Temporal Difference(120582)

In this section we provide a brief introduction to kernelmethods followed by the derivation of the KTD algorithm[17 22] One of the contributions of the present work is theconvergence analysis of KTD(120582) presented at the end of thissection

31 Kernel Methods Kernel methods are a family of algo-rithms for which input data are nonlinearly map to a high-dimensional feature space of vectors where linear operationsare carried out Let X be a nonempty set For a positivedefinite function 120581 X times X rarr R [14 25] there exists aHilbert spaceH and a mapping 120601 X rarr H such that

120581 (119909 119910) = ⟨120601 (119909) 120601 (119910)⟩ (4)

The inner product in the high-dimensional feature space canbe calculated by evaluating the kernel function in the inputspace Here H is called a reproducing kernel Hilbert space(RKHS) for which the following property holds

119891 (119909) = ⟨119891 120601 (119909)⟩ = ⟨119891 120581 (119909 sdot)⟩ forall119891 isinH (5)

The mapping implied by the use of the kernel function120581 can also be understood through Mercerrsquos Theorem [26]The implicit map 120601 allows one to transform conventionallinear algorithms in the feature space to nonlinear systemsin the input space and the kernel function 120581 provides animplicit way to compute inner products in the RKHS withoutexplicitly dealing with the high-dimensional space

32 Kernel Temporal Difference(120582) In the multistep pre-diction problem we consider a sequence of input-outputpairs (119909(1) 119889(1)) (119909(2) 119889(2)) (119909(119898) 119889(119898)) for whichthe desired output 119889 is only available at time 119898 + 1Consequently the system should produce a sequence ofpredictions 119910(1) 119910(2) 119910(119898) based solely on the observedinput sequences before it gets access to the desired responseIn general the predicted output is a function of all previousinputs 119910(119899) = 119891(119909(1) 119909(2) 119909(119899)) Here we assume that119910(119899) = 119891(119909(119899)) for simplicity and let the function 119891 belongto a RKHSH

In supervised learning by treating the observed inputsequence and the desired prediction as a sequence of pairs

4 Computational Intelligence and Neuroscience

(119909(1) 119889) (119909(2) 119889) (119909(119898) 119889) andmaking119889 ≜ 119910(119898+1) wecan obtain the updates of function119891 after the whole sequenceof119898 inputs has been observed as

119891 larr997888 119891 +

119898

sum

119899=1

Δ119891119899 (6)

= 119891 + 120578

119898

sum

119899=1

[119889 minus 119891 (119909 (119899))] 120601 (119909 (119899)) (7)

Here Δ119891119899= 120578[119889 minus ⟨119891 120601(119909(119899))⟩]120601(119909(119899)) are the instantaneous

updates of the function119891 from input data based on the kernelexpansion (5)

The key observation to extend the supervised learningapproach to the TD method is that the difference betweendesired and predicted output at time 119899 can be written as

119889 minus 119910 (119899) =

119898

sum

119896=119899

(119910 (119896 + 1) minus 119910 (119896)) (8)

where 119910(119898 + 1) ≜ 119889 Using this expansion in terms of thedifferences between sequential predictions we can update thesystem at each time step By replacing the error 119889 minus 119891(119909(119899))in (7) using the relation with temporal differences (8) andrearranging the equation as in [6] we obtain the followingupdate

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119891 (119909 (119899 + 1)) minus 119891 (119909 (119899))]

119899

sum

119896=1

120601 (119909 (119896)) (9)

In this case all predictions are used equally Using exponen-tial weighting on recency yields the following update rule

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119891 (119909 (119899 + 1)) minus 119891 (119909 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(10)

Here 120582 represent an eligibility trace rate that is added to theaveraging process over temporal differences to emphasize onthe most recently observed states and to efficiently deal withdelayed rewards

The above update rule (10) is called kernel temporaldifference (KTD)(120582) [17]The difference between predictionsof sequential inputs is called temporal difference (TD) error

119890TD (119899) = 119891 (119909 (119899 + 1)) minus 119891 (119909 (119899)) (11)

Note that the temporal differences 119891(119909(119899 + 1)) minus 119891(119909(119899)) canbe rewritten using the kernel expansions as ⟨119891 120601(119909(119899+ 1))⟩ minus⟨119891 120601(119909(119899))⟩This yields the instantaneous update of the func-tion 119891 as Δ119891

119899= 120578⟨119891 120601(119909(119899+1))minus120601(119909(119899))⟩sum

119899

119896=1120582119899minus119896120601(119909(119896))

Using the RKHS properties the evaluation of the function 119891at a given 119909 can be calculated from the kernel expansion

In reinforcement learning the prediction 119910(119899) = 119891(119909(119899))can be considered as the value function (1) or (2) Thisis how the KTD algorithm provides a nonlinear functionapproximation to Bellmanrsquos equation When the prediction119910(119899) represents the state value function the TD error (11)

is extended to the combination of a reward and sequentialvalue function predictions For instance in the case of policyevaluation the TD error is defined as

119890TD (119899) = 119903 (119899 + 1) + 120574119881 (119909 (119899 + 1)) minus 119881 (119909 (119899)) (12)

33 Convergence of Kernel Temporal Difference(120582) It hasbeen shown in [6 27] that for an absorbing Markov chainTD(120582) converges with probability 1 under certain conditionsRecall that the conventional TD algorithm assumes thefunction class to be linearly parametrized satisfying119910 = 119908⊤119909KTD(120582) can be viewed as a linear function approximation inthe RKHS Using this relation convergence of KTD(120582) canbe obtained as an extension of the convergence guaranteesalready established for TD(120582)

When 120582 = 1 by definition the KTD(120582 = 1) procedureis equivalent to the supervised learning method (7) KTD(1)yields the same per-sequence weight changes as the leastsquare solution since (9) is derived directly from supervisedlearning by replacing the error term in (8) Thus the conver-gence of KTD(1) can be established based on the convergenceof its equivalent supervised learning formulation which wasproven in [25]

Proposition 1 TheKLMS algorithm converges asymptoticallyin themean sense to the optimal solution under the ldquosmall-step-sizerdquo condition

Theorem 2 When the stepsize 120578119899satisfies 120578

119899ge 0 suminfin

119899=1120578119899=

infin and suminfin119899=1

1205782

119899lt infin KTD(1) converges asymptotically in the

mean sense to the least square solution

Proof Since by (8) the sequence of TD errors can be replacedby amultistep prediction with error 119890(119899) = 119889minus119910(119899) the resultof Proposition 1 also applies to this case

In the case of 120582 lt 1 as shown by [27] the convergenceof linear TD(120582) can be proved based on the ordinarydifferential equation (ODE) method introduced in [28] Thisresult can be easily extended to KTD(120582) as follows Letus consider the Markov estimation problem as in [6] AnabsorbingMarkov chain can be described by the terminal andnonterminal sets of states T and N transition probabilities119901119894119895between nonterminal states the transition probabilities

119904119894119895from nonterminal states to terminal states the vectors 119909

119894

representing the nonterminal states the expected terminalreturns 119889

119895from the 119895th terminal state and the probabilities

120583119894of starting at state 119894 Given an initial state 119894 isin N an

absorbing Markov chain generates an observation sequenceof 119898 vectors 119909

1198941

1199091198942

119909119894119898

where the last element 119909119894119898

ofthe sequence corresponds to a terminal state 119894

119898isin T The

expected outcome 119889 given a sequence starting at 119894 isin N isgiven by

119890lowast

119894equiv 119864 [119889 | 119894] (13)

= sum

119895isinT

119904119894119895119889119895+ sum

119895isinN

119901119894119895sum

119896isinT

119901119895119896119889119896+ sdot sdot sdot (14)

Computational Intelligence and Neuroscience 5

= [

infin

sum

119896=0

119876119896ℎ]

119894

= [(119868 minus 119876)minus1ℎ]119894 (15)

where [119909]119894denotes the 119894th element of the array 119909 119876 is the

transition matrix with entries [119876]119894119895= 119901119894119895for 119894 119895 isin N and

[ℎ]119894= sum119895isinT 119904119894119895119889119895 for 119894 isin N In linear TD(120582) a sequence

of vectors 1199081 1199082 is generated Each one of these vectors

119908119899is generated after having a complete observation sequence

that is a sequence staring at state 119894 isin N and ending at state119895 isin T with the respective return 119889

119895 Similar to linear TD(120582)

inKTD(120582)we have a sequence of functions1198911 1198912 (vectors

in a RKHS) for which we can also write a linear updateof the mean estimates of terminal return after 119899 sequenceshave been observed If 119891

119899is the actual function estimate after

sequence 119899 and 119891119899+1

is the expected function estimate afterthe next sequence we have that

119891119899+1

(119883) = 119891119899 (119883) + 120578119899+1H (119891119899 (119883) minus 119890

lowast) (16)

where H = minusK119863[119868 minus (1 minus 120582)119876(119868 minus 120582119876)minus1] [K]119894119895= 120581(119909

119894 119909119895)

with 119894 119895 isin N 119863 is a diagonal matrix and [119863]119894119894the expected

number of times the state 119894 is visited during a sequence and119891119899(119883) is a column vector of function evaluations of the state

representations such that [119891119899(119883)]119894= 119891119899(119909119894) = ⟨119891

119899 120601(119909119894)⟩

Analogously to [27] the mean estimates in (16) convergeappropriately if H has a full set of eigenvalues with negativereal parts for which we need K to be full rank For the aboveto be true it is required the set of vectors 120601(119909

119894)119894isinN to be

linearly independent in the RKHS This is exactly the casewhen the kernel 120581 is strictly positive definite as shown in thefollowing proposition

Proposition 3 If 120581 XtimesX rarr R is a strictly positive definitekernel for any finite set 119909

119894119873

119894=1sube X of distinct elements the set

120601(119909119894) is linearly independent

Proof If 120581 is strictly positive definite thensum120572119894120572119895120581(119909119894 119909119895) gt 0

for any set 119909119894where 119909

119894= 119909119895 for all 119894 = 119895 and any 120572

119894isin R

such that not all 120572119894= 0 Suppose there exists a set 119909

119894 for

which 120601(119909119894) are not linearly independent Then there must

be a set of coefficients 120572119894isin R not all equal to zero such that

sum120572119894120601(119909119894) = 0 which implies that sum120572

119894120601(119909119894)2= 0

0 = sum120572119894120572119895⟨120601 (119909119894) 120601 (119909

119895)⟩ = sum120572

119894120572119895120581 (119909119894 119909119895) (17)

which contradicts the assumption

The following Theorem is the resulting extension ofTheorem 119879 in [27] to KTD(120582)

Theorem4 For any absorbingMarkov chain for any distribu-tion of starting probailities120583

119894such that there are not inaccessible

states for any outcome distributions with finite expected values119889119895 for any strictly positive definite kernel 120581 and any set of

observation vectors 119909119894 119894 isin N such that 119909

119894= 119909119895if and only if

119894 = 119895 there exists an 120598 gt 0 such that if 120578119899= 120578 where 0 lt 120578 lt 120598

and for any initial function estimate the predictions of KTD(120582)converge in expected value to the ideal predictions of (15) If 119891

119899

denotes the function estimate after experiencing 119899 sequencesthen

lim119899rarrinfin

119864 [119891119899(119909119894)] = 119864 [119889 | 119894] = [(119868 minus 119876)

minus1ℎ]119894 forall119894 isinN

(18)

4 119876-Learning via Kernel TemporalDifferences(120582)

Since the value function represents the expected cumulativerewards given a policy the policy 120587 is better than the policy1205871015840 when the policy 120587 gives greater expected return than the

policy 1205871015840 In other words 120587 ge 1205871015840 if and only if 119876120587(119909 119886) ge

1198761205871015840

(119909 119886) for all 119909 isin X and 119886 isin A Therefore the optimalaction value function 119876 can be written as 119876lowast(119909(119899) 119886(119899)) =max120587119876120587(119909(119899) 119886(119899)) The estimation can be done online To

maximize the expected reward 119864[119903(119899 + 1) | 119909(119899) 119886(119899) 119909(119899 +1)] one-step 119876-learning update was introduced in [29]

119876 (119909 (119899) 119886 (119899)) larr997888 119876 (119909 (119899) 119886 (119899))

+ 120578 [119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

(19)

At time 119899 an action 119886(119899) can be selected using methods suchas 120598-greedy or the Boltzmann distribution which are popularfor exploration and exploitation trade-off [30]

When we consider the prediction 119910 as action value func-tion 119876120587 with respect to a policy 120587 KTD(120582) can approximatethe value function119876120587 using a family of functions of the form

119876 (119909 (119899) 119886 = 119894) = 119891 (119909 | 119886 = 119894) = ⟨119891 120601 (119909 (119899))⟩ (20)

Here 119876(119909(119899) 119886 = 119894) denotes a state-action value given astate 119909(119899) at time 119899 and a discrete action 119894 Therefore theupdate rule for119876-learning via kernel temporal difference (119876-KTD)(120582) can be written as

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(21)

We can see that the temporal difference (TD) error at time 119899includes reward and action value function terms For single-step prediction problems (119898 = 1) (10) yields single updatesfor 119876-KTD(120582) of the form

119876119894(119909 (119899)) = 120578

119899minus1

sum

119895=1

119890TD119894 (119895) 119868119896 (119895) 120581 ⟨119909 (119899) 119909 (119895)⟩ (22)

Here 119876119894(119909(119899)) = 119876(119909(119899) 119886 = 119894) and 119890TD119894(119899) denotes the TD

error defined as 119890TD119894(119899) = 119903119894 + 120574119876119894119894(119909(119899 + 1)) minus 119876119894(119909(119899)) and119868119896(119899) is an indicator vector of size determined by the number

6 Computational Intelligence and Neuroscience

CalculaterewardExploration

Exploitation

Statevector

Actionx(n)

x(n)

x(1)

x(2)

x(3)

x(n minus 2)

x(n minus 1)

Σ

Σ

ΣQi(x(n))

Reward selected Q value

a(n) = 120578Σnj=1120582

nminusjIk(j)eTD(j)

Figure 2 The structure of 119876-learning via kernel temporal difference(120582)

of outputs (actions) Only the 119896th entry of the vector is setto 1 and the other entries are set to 0 The selection of theaction unit 119896 at time 119899 can be based on a greedy methodTherefore only the weight (parameter vector) correspondingto the winning action gets updated Recall that the reward 119903

119894

corresponds to the action selected by the current policy withinput 119909(119899) because it is assumed that this action causes thenext input state 119909(119899 + 1)

The structure of 119876-learning via KTD(0) is shown inFigure 2 The number of units (kernel evaluations) increasesas more input data arrives Each added unit is centered at theprevious input locations 119909(1) 119909(2) 119909(119899 minus 1)

In the reinforcement learning brain machine interface(RLBMI) paradigm kernel temporal difference(120582) helpsmodel the agent (see Figure 1) The action value function119876 can be approximated using KTD(120582) for which the ker-nel based representations enhance the functional mappingcapabilities of the system Based on the estimated 119876 valuesa policy decides a proper action Note that the policy cor-responds to the learning policy which changes over time in119876-learning

5 Online Sparsification

One characteristic of nonparametric approaches is theirinherently growing structure which is usually linear in thenumber of input data points This rate of growth becomesprohibitive for practical applications that handle increasingamounts of incoming data over time Various methods havebeen proposed to alleviate this problem (see [31] and refer-ences therein)Thesemethods known as kernel sparsificationmethods can be applied to the KTD algorithm to controlthe growth of the terms in the function expansion also

known as filter size Popular examples of kernel sparsificationmethods are the approximate linear dependence (ALD) [19]Surprise criterion [32] Quantization approach [21] andthe kernel distance based method [23] The main idea ofsparsification is to only consider a reduced set of samplescalled the dictionary to represent the function of interestThecomputational complexity ofALD is119874(1198892) where119889 is the sizeof the dictionary For the other methods mentioned abovethe complexity is 119874(119889)

Each of these methods has its own criterion to determinewhether an incoming sample should be added to the currentdictionary The Surprise criterion [32] measures the subjec-tive information of exemplar 119909 119889 with respect to a learningsystem Γ

119878Γ(119909 119889) = minus ln119901 (119909 119889 | Γ) (23)

Only samples with high values of Surprise are considered ascandidates for the dictionary In the case of the Quantizationapproach introduced in [21] the distance between a newinput 119909(119899) and the existing dictionary elements 119862(119899 minus 1) isevaluated The new input sample is added to the dictionaryif the distance between the new input 119909(119899) and the closestelement in 119862(119899 minus 1)

min119909119894isin119862(119899minus1)

1003817100381710038171003817119909 (119899) minus 1199091198941003817100381710038171003817 gt 120598119880 (24)

is larger than the Quantization size 120598119880 Otherwise the new

input state 119909(119899) is absorbed by the closest existing unit Verysimilar to the quantization approach the method presentedin [23] applies a distance threshold criterion in the RKHSThe kernel distance based criterion given a state dictionary

Computational Intelligence and Neuroscience 7

119863(119899 minus 1) adds a new unit when the new input state 119909(119899)satisfies following condition

min119909119894isin119863(119899minus1)

1003817100381710038171003817120601(119909(119899)) minus 120601(119909119894)10038171003817100381710038172gt 1205831 (25)

For some kernels such as Gaussian the Quantizationmethodand the kernel distance based criterion can be shown to beequivalent

6 Simulations

Note that the KTD algorithm has been introduced for valuefunction estimation To evaluate the algorithmrsquos nonlinearcapability we first examine the performance of theKTD(120582) inthe problem of state value function estimation given a fixedpolicy 120587 We carry out experiments on a simple illustrativeMarkov chain initially described in [33] This is a popularexperiment involving an episodic task to test TD learningalgorithms The experiment is useful in illustrating linear aswell as nonlinear functions of the state representations andshows how the state value function is estimated using theadaptive system

61 Linear Case Even though we emphasize the capabilityof KTD(120582) as a nonlinear function approximator underthe appropriate kernel size KTD(120582) should approximatelinear functions on a region of interest as well To test itsefficacy we observe the performance on a simple Markovchain (Figure 3) There are 13 states numbered from 12 to0 Each trial starts at state 12 and terminates at state 0Each state is represented by a 4-dimensional vector and therewards are assigned in such a way that the value function119881 is a linear function on the states namely 119881lowast takes thevalues [0 minus2 minus4 minus22 minus24] at states [0 1 2 11 12]In the case of 119881 = 119908

⊤119909 the optimal weights are 119908lowast =

[minus24 minus16 minus8 0]To assess the performance the updated estimate of the

state value function (119909) is compared to the optimal valuefunction119881lowast at the end of each trialThis is done by computingthe RMS error of the value function over all states

RMS = radic1

119899sum

119909isinX

(119881lowast(119909) minus (119909))2

(26)

where 119899 is the number of states 119899 = 13Stepsize scheduling is applied as follows

120578 (119899) = 1205780

1198860+ 1

1198860+ 119899

where 119899 = 1 2 (27)

where 1205780is the initial stepsize and 119886

0is the annealing

factor which controls how fast the stepsize decreases In thisexperiment 119886

0= 100 is applied Furthermore we assume that

the policy 120587 is guaranteed to terminate which means that thevalue function 119881120587 is well-behaved without using a discountfactor 120574 in (3) that is 120574 = 1

In KTD(120582) we employ the Gaussian kernel

120581 (119909 (119894) 119909 (119895)) = exp(minus1003817100381710038171003817119909(119894) minus 119909(119895)

10038171003817100381710038172

2ℎ2) (28)

Start11 10

End

3 2 1middot middot middot

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3 minus3

minus3

minus3

minus3 minus212 0

[1 0 0 0]

[34 14 0 0]

[0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

[12 12 0 0]

Figure 3 A 13-state Markov chain [33] For states from 2 to 12the state transition probabilities are 05 and the correspondingrewards are minus3 State 1 has state transition probability of 1 to theterminal state 0 and a reward of minus2 States 12 8 4 and 0 havethe 4-dimensional state space representations [1 0 0 0] [0 1 0 0][0 0 1 0] and [0 0 0 1] respectively The representations of theother states are linear interpolations between the above vectors

which is a universal kernel commonly encountered in prac-tice To find the optimal kernel size we fix all the other freeparameters around median values 120582 = 04 and 120578

0= 05

and the average RMS error over 10 Monte Carlo runs iscompared For this specific experiment smaller kernel sizesyield better performance since the state representations arefinite However in general applying too small kernel sizesleads to over-fitting and slow learning In particular choosinga very small kernel makes the algorithm behave very similarto the table look up method Thus we choose the kernel sizeℎ = 02 to be the largest kernel size for which we obtainsimilar mean RMS values as for smaller kernel sizes

After fixing the kernel size to ℎ = 02 the experimentalevaluation of different combinations of eligibility trace rates120582 and initial step sizes 120578

0are observed Figure 4 shows the

average performance over 10 Monte Carlo runs for 1000trials

All 120582 values with optimal stepsize show good approxima-tion to 119881lowast after 1000 trials Notice that KTD(120582 = 0) showsslightly better performance than KTD(120582 = 1) This may beattributed to the local nature ofKTDwhenusing theGaussiankernel In addition varying the stepsize has a relatively smalleffect on KTD(120582) The Gaussian kernel as well as other shift-invariant kernels provide an implicit normalized update rulewhich is known to be less sensitive to stepsize Based onFigure 4 the optimal eligibility trace rate and initial stepsizevalue 120582 = 06 and 120578

0= 03 are selected for KTD with kernel

size ℎ = 02The learning curve of KTD(120582) is compared to the con-

ventional TD algorithm TD(120582) The optimal parametersemployed in both algorithms are based on the experimentalevaluation In TD(120582) 120582 = 1 and 120578

0= 01 are applied The

RMS error is averaged over 50 Monte Carlo runs for 1000trials Comparative learning curves are given in Figure 5

In this experiment we confirm the ability of KTD(120582) tohandle the function approximation problem when the fixedpolicy yields a state value function that is linear in the staterepresentation Both algorithms reach the mean RMS valueof around 006 As we expected TD(120582) converges faster to theoptimal solution because of the linear nature of the problemKTD(120582) converges slower than TD(120582) but it is also ableto approximate the value function properly In this sense

8 Computational Intelligence and Neuroscience

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 4 Performance comparison over different combinations ofeligibility trace rates 120582 and initial step sizes 120578

0in KTD(120582) with ℎ =

02 The vertical line segments contain the mean RMS value after100 trials (top marker) 500 trials (middle marker) and 1000 trials(bottom marker)

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDTD

Figure 5 Learning curve of KTD(120582) and TD(120582) The solid lineshows the mean RMS error and the dashed line shows the +minusstandard deviations over 50Monte Carlo runs

the KTD algorithm is open to wider class of problems thanits linear counterpart

62 Nonlinear Case Previous section show the performancesof KTD(120582) on the problem of estimating a state value

Start11 10

End

3 2 112 0middot middot middot

[1 0 0 0]

[34 14 0 0]

[12 12 0 0] [0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

minus8

minus8

minus4 minus4

minus4

minus2 minus2

minus2minus2

minus1

minus1

minus05

minus05 minus02

Figure 6 A 13-state Markov chain In states from 2 to 12 each statetransition has probability 05 and state 1 has transition probability1 to the absorbing state 0 Note that optimal state value functionscan be represented as a nonlinear function of the states andcorresponding reward values are assigned to each state

function which is a linear function of the given state repre-sentation The same problem can be turned into a nonlinearone bymodifying the reward values in the chain such that theresulting state value function119881lowast is no longer a linear functionof the states

The number of states and the state representations remainthe same as in the previous section However the optimalvalue function 119881

lowast becomes nonlinear with respect tothe representation of the states namely119881lowast = [0 minus02 minus06

minus 14 minus 3 minus 62 minus 126 minus 134 minus 135 minus 1445 minus 15975

minus 192125 minus 255938] for states 0 to 12 This implies that thereward values for each state are different from the ones givenfor the linear case (Figure 6)

Again to evaluate the performance after each trial iscompleted the estimated state value is compared to theoptimal state value 119881lowast using RMS error (26) For KTD(120582)the Gaussian kernel (28) is applied and kernel size ℎ = 02 ischosen Figure 7 shows the average RMS error over 10MonteCarlo runs for 1000 trials

The combination of 120582 = 04 and 1205780= 03 shows the best

performance but the 120582 = 0 case also shows good perfor-mances Unlike TD(120582) [6] there is no dominant value for 120582in KTD(120582) Recall that it has been proved that convergenceis guaranteed for linearly independent representations of thestates which is automatically fulfilled in KTD(120582) when thekernel is strictly positive definite Therefore the differencesare rather due to the convergence speed controlled by theinteraction between the step size and the elegibilty trace

The average RMS error over 50Monte Carlo runs is com-pared with Gaussian process temporal difference (GPTD)[15] and TD(120582) in Figure 8The purpose of GPTD implemen-tation is to have comparison among kernelized value functionapproximations Here the applied optimal parameters forKTD(120582) are 120582 = 04 120578

0= 03 and ℎ = 02 for GPTD 120582 = 1

1205902= 05 and ℎ = 02 and for TD(120582) 120582 = 08 and 120578

0= 01

The linear function approximation TD(120582) (blue line)cannot estimate the optimal state values KTD(120582) outper-forms the linear algorithm as expected since the Gaussiankernel is strictly positive definite GPTD also learns the targetstate values but the system fails to reach as low error valuesas KTD GPTD is sensitive to the selection of the covariancevalue in the noise1205902 if the value is small the system becomesunstable and larger values cause the the learning to slowdown GPTD models the residuals the difference between

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

Computational Intelligence and Neuroscience 3

Action

Reward

State

State

Computer cursorrobot arm

Target

Adaptive system

Kernel temporal

TD error

Value function Policy

x(n)

Q

r(n + 1)

a(n + 1)

x(n + 1)

differences (120582)

BMI userrsquosbrain

Agent (BMI decoder)Environment

Figure 1 The decoding structure of reinforcment learning modelin a brain machine interface using a 119876-learning based functionapproximation algorithm

The controller in the BMI can be considered as a neu-ral decoder and the environment includes the BMI user(Figure 1)

Assuming the environment is a stochastic and stationaryprocess that satisfies the Markov condition it is possibleto model the interaction between the learning agent andthe environment as a Markov decision process (MDP) Forthe sake of simplicity we assume the states and actions arediscrete but they can also be continuous

At time step 119899 the decoder receives the representationof the userrsquos neural state 119909(119899) isin X as input According tothis input the decoder selects an action 119886(119899) isin A whichcauses the state of the external device to change namely theposition of a cursor on a screen or a robotrsquos arm positionBased on the updated position the agent receives a reward119903(119899 + 1) isin R At the same time the updated position ofthe actuator will influence the userrsquos subsequent neural statesthat is going from 119909(119899) to 119909(119899 + 1) because of the visualfeedback involved in the process The new state 119909(119899 + 1)

follows the state transition probabilityP1198861199091199091015840 given the action

119886(119899) and the current state 119909(119899) At the new state 119909(119899 + 1) theprocess repeats the decoder takes an action 119886(119899+1) and thiswill result in a reward 119903(119899 + 2) and a state transition from119909(119899+ 1) to 119909(119899+ 2) This process continues either indefinitelyor until a terminal state is reached depending on the process

Note that the user has no direct access to actions and thedecoder must interpret the userrsquos brain activity correctly tofacilitate the rewards Also both systems act symbiotically bysharing the external device to complete their tasks Throughiterations both systems learn how to earn rewards basedon their joint behavior This is how the two intelligentsystems (the decoder and the user) learn coadaptively and theclosed loop feedback is created This coadaptation allows forcontinuous synergistic adaptation between the BMI decoderand the user even in changing environments [1]

The value function is a measure of long-term perfor-mance of an agent following a policy 120587 starting from a state119909(119899) The state value function is defined as

119881120587(119909 (119899)) = 119864120587 [R (119899) | 119909 (119899)] (1)

and action value function is given by

119876120587(119909 (119899) 119886 (119899)) = 119864120587 [R (119899) | 119909 (119899) 119886 (119899)] (2)

whereR(119899) is known as the return Here we apply a commonchoice for the return the infinite-horizon discounted model

R (119899) =infin

sum

119896=0

120574119896119903 (119899 + 119896 + 1) 0 lt 120574 lt 1 (3)

that takes into account the rewards in the long run but weighsthem with a discount factor to prevent the function fromgrowing unbounded as 119896 rarr infin and provides mathematicaltractability [24]Note that our goal is to find a policy120587 X rarr

A which maps a state 119909(119899) to an action 119886(119899) Estimating thevalue function is an essential step towards finding a properpolicy

3 Kernel Temporal Difference(120582)

In this section we provide a brief introduction to kernelmethods followed by the derivation of the KTD algorithm[17 22] One of the contributions of the present work is theconvergence analysis of KTD(120582) presented at the end of thissection

31 Kernel Methods Kernel methods are a family of algo-rithms for which input data are nonlinearly map to a high-dimensional feature space of vectors where linear operationsare carried out Let X be a nonempty set For a positivedefinite function 120581 X times X rarr R [14 25] there exists aHilbert spaceH and a mapping 120601 X rarr H such that

120581 (119909 119910) = ⟨120601 (119909) 120601 (119910)⟩ (4)

The inner product in the high-dimensional feature space canbe calculated by evaluating the kernel function in the inputspace Here H is called a reproducing kernel Hilbert space(RKHS) for which the following property holds

119891 (119909) = ⟨119891 120601 (119909)⟩ = ⟨119891 120581 (119909 sdot)⟩ forall119891 isinH (5)

The mapping implied by the use of the kernel function120581 can also be understood through Mercerrsquos Theorem [26]The implicit map 120601 allows one to transform conventionallinear algorithms in the feature space to nonlinear systemsin the input space and the kernel function 120581 provides animplicit way to compute inner products in the RKHS withoutexplicitly dealing with the high-dimensional space

32 Kernel Temporal Difference(120582) In the multistep pre-diction problem we consider a sequence of input-outputpairs (119909(1) 119889(1)) (119909(2) 119889(2)) (119909(119898) 119889(119898)) for whichthe desired output 119889 is only available at time 119898 + 1Consequently the system should produce a sequence ofpredictions 119910(1) 119910(2) 119910(119898) based solely on the observedinput sequences before it gets access to the desired responseIn general the predicted output is a function of all previousinputs 119910(119899) = 119891(119909(1) 119909(2) 119909(119899)) Here we assume that119910(119899) = 119891(119909(119899)) for simplicity and let the function 119891 belongto a RKHSH

In supervised learning by treating the observed inputsequence and the desired prediction as a sequence of pairs

4 Computational Intelligence and Neuroscience

(119909(1) 119889) (119909(2) 119889) (119909(119898) 119889) andmaking119889 ≜ 119910(119898+1) wecan obtain the updates of function119891 after the whole sequenceof119898 inputs has been observed as

119891 larr997888 119891 +

119898

sum

119899=1

Δ119891119899 (6)

= 119891 + 120578

119898

sum

119899=1

[119889 minus 119891 (119909 (119899))] 120601 (119909 (119899)) (7)

Here Δ119891119899= 120578[119889 minus ⟨119891 120601(119909(119899))⟩]120601(119909(119899)) are the instantaneous

updates of the function119891 from input data based on the kernelexpansion (5)

The key observation to extend the supervised learningapproach to the TD method is that the difference betweendesired and predicted output at time 119899 can be written as

119889 minus 119910 (119899) =

119898

sum

119896=119899

(119910 (119896 + 1) minus 119910 (119896)) (8)

where 119910(119898 + 1) ≜ 119889 Using this expansion in terms of thedifferences between sequential predictions we can update thesystem at each time step By replacing the error 119889 minus 119891(119909(119899))in (7) using the relation with temporal differences (8) andrearranging the equation as in [6] we obtain the followingupdate

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119891 (119909 (119899 + 1)) minus 119891 (119909 (119899))]

119899

sum

119896=1

120601 (119909 (119896)) (9)

In this case all predictions are used equally Using exponen-tial weighting on recency yields the following update rule

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119891 (119909 (119899 + 1)) minus 119891 (119909 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(10)

Here 120582 represent an eligibility trace rate that is added to theaveraging process over temporal differences to emphasize onthe most recently observed states and to efficiently deal withdelayed rewards

The above update rule (10) is called kernel temporaldifference (KTD)(120582) [17]The difference between predictionsof sequential inputs is called temporal difference (TD) error

119890TD (119899) = 119891 (119909 (119899 + 1)) minus 119891 (119909 (119899)) (11)

Note that the temporal differences 119891(119909(119899 + 1)) minus 119891(119909(119899)) canbe rewritten using the kernel expansions as ⟨119891 120601(119909(119899+ 1))⟩ minus⟨119891 120601(119909(119899))⟩This yields the instantaneous update of the func-tion 119891 as Δ119891

119899= 120578⟨119891 120601(119909(119899+1))minus120601(119909(119899))⟩sum

119899

119896=1120582119899minus119896120601(119909(119896))

Using the RKHS properties the evaluation of the function 119891at a given 119909 can be calculated from the kernel expansion

In reinforcement learning the prediction 119910(119899) = 119891(119909(119899))can be considered as the value function (1) or (2) Thisis how the KTD algorithm provides a nonlinear functionapproximation to Bellmanrsquos equation When the prediction119910(119899) represents the state value function the TD error (11)

is extended to the combination of a reward and sequentialvalue function predictions For instance in the case of policyevaluation the TD error is defined as

119890TD (119899) = 119903 (119899 + 1) + 120574119881 (119909 (119899 + 1)) minus 119881 (119909 (119899)) (12)

33 Convergence of Kernel Temporal Difference(120582) It hasbeen shown in [6 27] that for an absorbing Markov chainTD(120582) converges with probability 1 under certain conditionsRecall that the conventional TD algorithm assumes thefunction class to be linearly parametrized satisfying119910 = 119908⊤119909KTD(120582) can be viewed as a linear function approximation inthe RKHS Using this relation convergence of KTD(120582) canbe obtained as an extension of the convergence guaranteesalready established for TD(120582)

When 120582 = 1 by definition the KTD(120582 = 1) procedureis equivalent to the supervised learning method (7) KTD(1)yields the same per-sequence weight changes as the leastsquare solution since (9) is derived directly from supervisedlearning by replacing the error term in (8) Thus the conver-gence of KTD(1) can be established based on the convergenceof its equivalent supervised learning formulation which wasproven in [25]

Proposition 1 TheKLMS algorithm converges asymptoticallyin themean sense to the optimal solution under the ldquosmall-step-sizerdquo condition

Theorem 2 When the stepsize 120578119899satisfies 120578

119899ge 0 suminfin

119899=1120578119899=

infin and suminfin119899=1

1205782

119899lt infin KTD(1) converges asymptotically in the

mean sense to the least square solution

Proof Since by (8) the sequence of TD errors can be replacedby amultistep prediction with error 119890(119899) = 119889minus119910(119899) the resultof Proposition 1 also applies to this case

In the case of 120582 lt 1 as shown by [27] the convergenceof linear TD(120582) can be proved based on the ordinarydifferential equation (ODE) method introduced in [28] Thisresult can be easily extended to KTD(120582) as follows Letus consider the Markov estimation problem as in [6] AnabsorbingMarkov chain can be described by the terminal andnonterminal sets of states T and N transition probabilities119901119894119895between nonterminal states the transition probabilities

119904119894119895from nonterminal states to terminal states the vectors 119909

119894

representing the nonterminal states the expected terminalreturns 119889

119895from the 119895th terminal state and the probabilities

120583119894of starting at state 119894 Given an initial state 119894 isin N an

absorbing Markov chain generates an observation sequenceof 119898 vectors 119909

1198941

1199091198942

119909119894119898

where the last element 119909119894119898

ofthe sequence corresponds to a terminal state 119894

119898isin T The

expected outcome 119889 given a sequence starting at 119894 isin N isgiven by

119890lowast

119894equiv 119864 [119889 | 119894] (13)

= sum

119895isinT

119904119894119895119889119895+ sum

119895isinN

119901119894119895sum

119896isinT

119901119895119896119889119896+ sdot sdot sdot (14)

Computational Intelligence and Neuroscience 5

= [

infin

sum

119896=0

119876119896ℎ]

119894

= [(119868 minus 119876)minus1ℎ]119894 (15)

where [119909]119894denotes the 119894th element of the array 119909 119876 is the

transition matrix with entries [119876]119894119895= 119901119894119895for 119894 119895 isin N and

[ℎ]119894= sum119895isinT 119904119894119895119889119895 for 119894 isin N In linear TD(120582) a sequence

of vectors 1199081 1199082 is generated Each one of these vectors

119908119899is generated after having a complete observation sequence

that is a sequence staring at state 119894 isin N and ending at state119895 isin T with the respective return 119889

119895 Similar to linear TD(120582)

inKTD(120582)we have a sequence of functions1198911 1198912 (vectors

in a RKHS) for which we can also write a linear updateof the mean estimates of terminal return after 119899 sequenceshave been observed If 119891

119899is the actual function estimate after

sequence 119899 and 119891119899+1

is the expected function estimate afterthe next sequence we have that

119891119899+1

(119883) = 119891119899 (119883) + 120578119899+1H (119891119899 (119883) minus 119890

lowast) (16)

where H = minusK119863[119868 minus (1 minus 120582)119876(119868 minus 120582119876)minus1] [K]119894119895= 120581(119909

119894 119909119895)

with 119894 119895 isin N 119863 is a diagonal matrix and [119863]119894119894the expected

number of times the state 119894 is visited during a sequence and119891119899(119883) is a column vector of function evaluations of the state

representations such that [119891119899(119883)]119894= 119891119899(119909119894) = ⟨119891

119899 120601(119909119894)⟩

Analogously to [27] the mean estimates in (16) convergeappropriately if H has a full set of eigenvalues with negativereal parts for which we need K to be full rank For the aboveto be true it is required the set of vectors 120601(119909

119894)119894isinN to be

linearly independent in the RKHS This is exactly the casewhen the kernel 120581 is strictly positive definite as shown in thefollowing proposition

Proposition 3 If 120581 XtimesX rarr R is a strictly positive definitekernel for any finite set 119909

119894119873

119894=1sube X of distinct elements the set

120601(119909119894) is linearly independent

Proof If 120581 is strictly positive definite thensum120572119894120572119895120581(119909119894 119909119895) gt 0

for any set 119909119894where 119909

119894= 119909119895 for all 119894 = 119895 and any 120572

119894isin R

such that not all 120572119894= 0 Suppose there exists a set 119909

119894 for

which 120601(119909119894) are not linearly independent Then there must

be a set of coefficients 120572119894isin R not all equal to zero such that

sum120572119894120601(119909119894) = 0 which implies that sum120572

119894120601(119909119894)2= 0

0 = sum120572119894120572119895⟨120601 (119909119894) 120601 (119909

119895)⟩ = sum120572

119894120572119895120581 (119909119894 119909119895) (17)

which contradicts the assumption

The following Theorem is the resulting extension ofTheorem 119879 in [27] to KTD(120582)

Theorem4 For any absorbingMarkov chain for any distribu-tion of starting probailities120583

119894such that there are not inaccessible

states for any outcome distributions with finite expected values119889119895 for any strictly positive definite kernel 120581 and any set of

observation vectors 119909119894 119894 isin N such that 119909

119894= 119909119895if and only if

119894 = 119895 there exists an 120598 gt 0 such that if 120578119899= 120578 where 0 lt 120578 lt 120598

and for any initial function estimate the predictions of KTD(120582)converge in expected value to the ideal predictions of (15) If 119891

119899

denotes the function estimate after experiencing 119899 sequencesthen

lim119899rarrinfin

119864 [119891119899(119909119894)] = 119864 [119889 | 119894] = [(119868 minus 119876)

minus1ℎ]119894 forall119894 isinN

(18)

4 119876-Learning via Kernel TemporalDifferences(120582)

Since the value function represents the expected cumulativerewards given a policy the policy 120587 is better than the policy1205871015840 when the policy 120587 gives greater expected return than the

policy 1205871015840 In other words 120587 ge 1205871015840 if and only if 119876120587(119909 119886) ge

1198761205871015840

(119909 119886) for all 119909 isin X and 119886 isin A Therefore the optimalaction value function 119876 can be written as 119876lowast(119909(119899) 119886(119899)) =max120587119876120587(119909(119899) 119886(119899)) The estimation can be done online To

maximize the expected reward 119864[119903(119899 + 1) | 119909(119899) 119886(119899) 119909(119899 +1)] one-step 119876-learning update was introduced in [29]

119876 (119909 (119899) 119886 (119899)) larr997888 119876 (119909 (119899) 119886 (119899))

+ 120578 [119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

(19)

At time 119899 an action 119886(119899) can be selected using methods suchas 120598-greedy or the Boltzmann distribution which are popularfor exploration and exploitation trade-off [30]

When we consider the prediction 119910 as action value func-tion 119876120587 with respect to a policy 120587 KTD(120582) can approximatethe value function119876120587 using a family of functions of the form

119876 (119909 (119899) 119886 = 119894) = 119891 (119909 | 119886 = 119894) = ⟨119891 120601 (119909 (119899))⟩ (20)

Here 119876(119909(119899) 119886 = 119894) denotes a state-action value given astate 119909(119899) at time 119899 and a discrete action 119894 Therefore theupdate rule for119876-learning via kernel temporal difference (119876-KTD)(120582) can be written as

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(21)

We can see that the temporal difference (TD) error at time 119899includes reward and action value function terms For single-step prediction problems (119898 = 1) (10) yields single updatesfor 119876-KTD(120582) of the form

119876119894(119909 (119899)) = 120578

119899minus1

sum

119895=1

119890TD119894 (119895) 119868119896 (119895) 120581 ⟨119909 (119899) 119909 (119895)⟩ (22)

Here 119876119894(119909(119899)) = 119876(119909(119899) 119886 = 119894) and 119890TD119894(119899) denotes the TD

error defined as 119890TD119894(119899) = 119903119894 + 120574119876119894119894(119909(119899 + 1)) minus 119876119894(119909(119899)) and119868119896(119899) is an indicator vector of size determined by the number

6 Computational Intelligence and Neuroscience

CalculaterewardExploration

Exploitation

Statevector

Actionx(n)

x(n)

x(1)

x(2)

x(3)

x(n minus 2)

x(n minus 1)

Σ

Σ

ΣQi(x(n))

Reward selected Q value

a(n) = 120578Σnj=1120582

nminusjIk(j)eTD(j)

Figure 2 The structure of 119876-learning via kernel temporal difference(120582)

of outputs (actions) Only the 119896th entry of the vector is setto 1 and the other entries are set to 0 The selection of theaction unit 119896 at time 119899 can be based on a greedy methodTherefore only the weight (parameter vector) correspondingto the winning action gets updated Recall that the reward 119903

119894

corresponds to the action selected by the current policy withinput 119909(119899) because it is assumed that this action causes thenext input state 119909(119899 + 1)

The structure of 119876-learning via KTD(0) is shown inFigure 2 The number of units (kernel evaluations) increasesas more input data arrives Each added unit is centered at theprevious input locations 119909(1) 119909(2) 119909(119899 minus 1)

In the reinforcement learning brain machine interface(RLBMI) paradigm kernel temporal difference(120582) helpsmodel the agent (see Figure 1) The action value function119876 can be approximated using KTD(120582) for which the ker-nel based representations enhance the functional mappingcapabilities of the system Based on the estimated 119876 valuesa policy decides a proper action Note that the policy cor-responds to the learning policy which changes over time in119876-learning

5 Online Sparsification

One characteristic of nonparametric approaches is theirinherently growing structure which is usually linear in thenumber of input data points This rate of growth becomesprohibitive for practical applications that handle increasingamounts of incoming data over time Various methods havebeen proposed to alleviate this problem (see [31] and refer-ences therein)Thesemethods known as kernel sparsificationmethods can be applied to the KTD algorithm to controlthe growth of the terms in the function expansion also

known as filter size Popular examples of kernel sparsificationmethods are the approximate linear dependence (ALD) [19]Surprise criterion [32] Quantization approach [21] andthe kernel distance based method [23] The main idea ofsparsification is to only consider a reduced set of samplescalled the dictionary to represent the function of interestThecomputational complexity ofALD is119874(1198892) where119889 is the sizeof the dictionary For the other methods mentioned abovethe complexity is 119874(119889)

Each of these methods has its own criterion to determinewhether an incoming sample should be added to the currentdictionary The Surprise criterion [32] measures the subjec-tive information of exemplar 119909 119889 with respect to a learningsystem Γ

119878Γ(119909 119889) = minus ln119901 (119909 119889 | Γ) (23)

Only samples with high values of Surprise are considered ascandidates for the dictionary In the case of the Quantizationapproach introduced in [21] the distance between a newinput 119909(119899) and the existing dictionary elements 119862(119899 minus 1) isevaluated The new input sample is added to the dictionaryif the distance between the new input 119909(119899) and the closestelement in 119862(119899 minus 1)

min119909119894isin119862(119899minus1)

1003817100381710038171003817119909 (119899) minus 1199091198941003817100381710038171003817 gt 120598119880 (24)

is larger than the Quantization size 120598119880 Otherwise the new

input state 119909(119899) is absorbed by the closest existing unit Verysimilar to the quantization approach the method presentedin [23] applies a distance threshold criterion in the RKHSThe kernel distance based criterion given a state dictionary

Computational Intelligence and Neuroscience 7

119863(119899 minus 1) adds a new unit when the new input state 119909(119899)satisfies following condition

min119909119894isin119863(119899minus1)

1003817100381710038171003817120601(119909(119899)) minus 120601(119909119894)10038171003817100381710038172gt 1205831 (25)

For some kernels such as Gaussian the Quantizationmethodand the kernel distance based criterion can be shown to beequivalent

6 Simulations

Note that the KTD algorithm has been introduced for valuefunction estimation To evaluate the algorithmrsquos nonlinearcapability we first examine the performance of theKTD(120582) inthe problem of state value function estimation given a fixedpolicy 120587 We carry out experiments on a simple illustrativeMarkov chain initially described in [33] This is a popularexperiment involving an episodic task to test TD learningalgorithms The experiment is useful in illustrating linear aswell as nonlinear functions of the state representations andshows how the state value function is estimated using theadaptive system

61 Linear Case Even though we emphasize the capabilityof KTD(120582) as a nonlinear function approximator underthe appropriate kernel size KTD(120582) should approximatelinear functions on a region of interest as well To test itsefficacy we observe the performance on a simple Markovchain (Figure 3) There are 13 states numbered from 12 to0 Each trial starts at state 12 and terminates at state 0Each state is represented by a 4-dimensional vector and therewards are assigned in such a way that the value function119881 is a linear function on the states namely 119881lowast takes thevalues [0 minus2 minus4 minus22 minus24] at states [0 1 2 11 12]In the case of 119881 = 119908

⊤119909 the optimal weights are 119908lowast =

[minus24 minus16 minus8 0]To assess the performance the updated estimate of the

state value function (119909) is compared to the optimal valuefunction119881lowast at the end of each trialThis is done by computingthe RMS error of the value function over all states

RMS = radic1

119899sum

119909isinX

(119881lowast(119909) minus (119909))2

(26)

where 119899 is the number of states 119899 = 13Stepsize scheduling is applied as follows

120578 (119899) = 1205780

1198860+ 1

1198860+ 119899

where 119899 = 1 2 (27)

where 1205780is the initial stepsize and 119886

0is the annealing

factor which controls how fast the stepsize decreases In thisexperiment 119886

0= 100 is applied Furthermore we assume that

the policy 120587 is guaranteed to terminate which means that thevalue function 119881120587 is well-behaved without using a discountfactor 120574 in (3) that is 120574 = 1

In KTD(120582) we employ the Gaussian kernel

120581 (119909 (119894) 119909 (119895)) = exp(minus1003817100381710038171003817119909(119894) minus 119909(119895)

10038171003817100381710038172

2ℎ2) (28)

Start11 10

End

3 2 1middot middot middot

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3 minus3

minus3

minus3

minus3 minus212 0

[1 0 0 0]

[34 14 0 0]

[0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

[12 12 0 0]

Figure 3 A 13-state Markov chain [33] For states from 2 to 12the state transition probabilities are 05 and the correspondingrewards are minus3 State 1 has state transition probability of 1 to theterminal state 0 and a reward of minus2 States 12 8 4 and 0 havethe 4-dimensional state space representations [1 0 0 0] [0 1 0 0][0 0 1 0] and [0 0 0 1] respectively The representations of theother states are linear interpolations between the above vectors

which is a universal kernel commonly encountered in prac-tice To find the optimal kernel size we fix all the other freeparameters around median values 120582 = 04 and 120578

0= 05

and the average RMS error over 10 Monte Carlo runs iscompared For this specific experiment smaller kernel sizesyield better performance since the state representations arefinite However in general applying too small kernel sizesleads to over-fitting and slow learning In particular choosinga very small kernel makes the algorithm behave very similarto the table look up method Thus we choose the kernel sizeℎ = 02 to be the largest kernel size for which we obtainsimilar mean RMS values as for smaller kernel sizes

After fixing the kernel size to ℎ = 02 the experimentalevaluation of different combinations of eligibility trace rates120582 and initial step sizes 120578

0are observed Figure 4 shows the

average performance over 10 Monte Carlo runs for 1000trials

All 120582 values with optimal stepsize show good approxima-tion to 119881lowast after 1000 trials Notice that KTD(120582 = 0) showsslightly better performance than KTD(120582 = 1) This may beattributed to the local nature ofKTDwhenusing theGaussiankernel In addition varying the stepsize has a relatively smalleffect on KTD(120582) The Gaussian kernel as well as other shift-invariant kernels provide an implicit normalized update rulewhich is known to be less sensitive to stepsize Based onFigure 4 the optimal eligibility trace rate and initial stepsizevalue 120582 = 06 and 120578

0= 03 are selected for KTD with kernel

size ℎ = 02The learning curve of KTD(120582) is compared to the con-

ventional TD algorithm TD(120582) The optimal parametersemployed in both algorithms are based on the experimentalevaluation In TD(120582) 120582 = 1 and 120578

0= 01 are applied The

RMS error is averaged over 50 Monte Carlo runs for 1000trials Comparative learning curves are given in Figure 5

In this experiment we confirm the ability of KTD(120582) tohandle the function approximation problem when the fixedpolicy yields a state value function that is linear in the staterepresentation Both algorithms reach the mean RMS valueof around 006 As we expected TD(120582) converges faster to theoptimal solution because of the linear nature of the problemKTD(120582) converges slower than TD(120582) but it is also ableto approximate the value function properly In this sense

8 Computational Intelligence and Neuroscience

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 4 Performance comparison over different combinations ofeligibility trace rates 120582 and initial step sizes 120578

0in KTD(120582) with ℎ =

02 The vertical line segments contain the mean RMS value after100 trials (top marker) 500 trials (middle marker) and 1000 trials(bottom marker)

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDTD

Figure 5 Learning curve of KTD(120582) and TD(120582) The solid lineshows the mean RMS error and the dashed line shows the +minusstandard deviations over 50Monte Carlo runs

the KTD algorithm is open to wider class of problems thanits linear counterpart

62 Nonlinear Case Previous section show the performancesof KTD(120582) on the problem of estimating a state value

Start11 10

End

3 2 112 0middot middot middot

[1 0 0 0]

[34 14 0 0]

[12 12 0 0] [0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

minus8

minus8

minus4 minus4

minus4

minus2 minus2

minus2minus2

minus1

minus1

minus05

minus05 minus02

Figure 6 A 13-state Markov chain In states from 2 to 12 each statetransition has probability 05 and state 1 has transition probability1 to the absorbing state 0 Note that optimal state value functionscan be represented as a nonlinear function of the states andcorresponding reward values are assigned to each state

function which is a linear function of the given state repre-sentation The same problem can be turned into a nonlinearone bymodifying the reward values in the chain such that theresulting state value function119881lowast is no longer a linear functionof the states

The number of states and the state representations remainthe same as in the previous section However the optimalvalue function 119881

lowast becomes nonlinear with respect tothe representation of the states namely119881lowast = [0 minus02 minus06

minus 14 minus 3 minus 62 minus 126 minus 134 minus 135 minus 1445 minus 15975

minus 192125 minus 255938] for states 0 to 12 This implies that thereward values for each state are different from the ones givenfor the linear case (Figure 6)

Again to evaluate the performance after each trial iscompleted the estimated state value is compared to theoptimal state value 119881lowast using RMS error (26) For KTD(120582)the Gaussian kernel (28) is applied and kernel size ℎ = 02 ischosen Figure 7 shows the average RMS error over 10MonteCarlo runs for 1000 trials

The combination of 120582 = 04 and 1205780= 03 shows the best

performance but the 120582 = 0 case also shows good perfor-mances Unlike TD(120582) [6] there is no dominant value for 120582in KTD(120582) Recall that it has been proved that convergenceis guaranteed for linearly independent representations of thestates which is automatically fulfilled in KTD(120582) when thekernel is strictly positive definite Therefore the differencesare rather due to the convergence speed controlled by theinteraction between the step size and the elegibilty trace

The average RMS error over 50Monte Carlo runs is com-pared with Gaussian process temporal difference (GPTD)[15] and TD(120582) in Figure 8The purpose of GPTD implemen-tation is to have comparison among kernelized value functionapproximations Here the applied optimal parameters forKTD(120582) are 120582 = 04 120578

0= 03 and ℎ = 02 for GPTD 120582 = 1

1205902= 05 and ℎ = 02 and for TD(120582) 120582 = 08 and 120578

0= 01

The linear function approximation TD(120582) (blue line)cannot estimate the optimal state values KTD(120582) outper-forms the linear algorithm as expected since the Gaussiankernel is strictly positive definite GPTD also learns the targetstate values but the system fails to reach as low error valuesas KTD GPTD is sensitive to the selection of the covariancevalue in the noise1205902 if the value is small the system becomesunstable and larger values cause the the learning to slowdown GPTD models the residuals the difference between

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

4 Computational Intelligence and Neuroscience

(119909(1) 119889) (119909(2) 119889) (119909(119898) 119889) andmaking119889 ≜ 119910(119898+1) wecan obtain the updates of function119891 after the whole sequenceof119898 inputs has been observed as

119891 larr997888 119891 +

119898

sum

119899=1

Δ119891119899 (6)

= 119891 + 120578

119898

sum

119899=1

[119889 minus 119891 (119909 (119899))] 120601 (119909 (119899)) (7)

Here Δ119891119899= 120578[119889 minus ⟨119891 120601(119909(119899))⟩]120601(119909(119899)) are the instantaneous

updates of the function119891 from input data based on the kernelexpansion (5)

The key observation to extend the supervised learningapproach to the TD method is that the difference betweendesired and predicted output at time 119899 can be written as

119889 minus 119910 (119899) =

119898

sum

119896=119899

(119910 (119896 + 1) minus 119910 (119896)) (8)

where 119910(119898 + 1) ≜ 119889 Using this expansion in terms of thedifferences between sequential predictions we can update thesystem at each time step By replacing the error 119889 minus 119891(119909(119899))in (7) using the relation with temporal differences (8) andrearranging the equation as in [6] we obtain the followingupdate

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119891 (119909 (119899 + 1)) minus 119891 (119909 (119899))]

119899

sum

119896=1

120601 (119909 (119896)) (9)

In this case all predictions are used equally Using exponen-tial weighting on recency yields the following update rule

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119891 (119909 (119899 + 1)) minus 119891 (119909 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(10)

Here 120582 represent an eligibility trace rate that is added to theaveraging process over temporal differences to emphasize onthe most recently observed states and to efficiently deal withdelayed rewards

The above update rule (10) is called kernel temporaldifference (KTD)(120582) [17]The difference between predictionsof sequential inputs is called temporal difference (TD) error

119890TD (119899) = 119891 (119909 (119899 + 1)) minus 119891 (119909 (119899)) (11)

Note that the temporal differences 119891(119909(119899 + 1)) minus 119891(119909(119899)) canbe rewritten using the kernel expansions as ⟨119891 120601(119909(119899+ 1))⟩ minus⟨119891 120601(119909(119899))⟩This yields the instantaneous update of the func-tion 119891 as Δ119891

119899= 120578⟨119891 120601(119909(119899+1))minus120601(119909(119899))⟩sum

119899

119896=1120582119899minus119896120601(119909(119896))

Using the RKHS properties the evaluation of the function 119891at a given 119909 can be calculated from the kernel expansion

In reinforcement learning the prediction 119910(119899) = 119891(119909(119899))can be considered as the value function (1) or (2) Thisis how the KTD algorithm provides a nonlinear functionapproximation to Bellmanrsquos equation When the prediction119910(119899) represents the state value function the TD error (11)

is extended to the combination of a reward and sequentialvalue function predictions For instance in the case of policyevaluation the TD error is defined as

119890TD (119899) = 119903 (119899 + 1) + 120574119881 (119909 (119899 + 1)) minus 119881 (119909 (119899)) (12)

33 Convergence of Kernel Temporal Difference(120582) It hasbeen shown in [6 27] that for an absorbing Markov chainTD(120582) converges with probability 1 under certain conditionsRecall that the conventional TD algorithm assumes thefunction class to be linearly parametrized satisfying119910 = 119908⊤119909KTD(120582) can be viewed as a linear function approximation inthe RKHS Using this relation convergence of KTD(120582) canbe obtained as an extension of the convergence guaranteesalready established for TD(120582)

When 120582 = 1 by definition the KTD(120582 = 1) procedureis equivalent to the supervised learning method (7) KTD(1)yields the same per-sequence weight changes as the leastsquare solution since (9) is derived directly from supervisedlearning by replacing the error term in (8) Thus the conver-gence of KTD(1) can be established based on the convergenceof its equivalent supervised learning formulation which wasproven in [25]

Proposition 1 TheKLMS algorithm converges asymptoticallyin themean sense to the optimal solution under the ldquosmall-step-sizerdquo condition

Theorem 2 When the stepsize 120578119899satisfies 120578

119899ge 0 suminfin

119899=1120578119899=

infin and suminfin119899=1

1205782

119899lt infin KTD(1) converges asymptotically in the

mean sense to the least square solution

Proof Since by (8) the sequence of TD errors can be replacedby amultistep prediction with error 119890(119899) = 119889minus119910(119899) the resultof Proposition 1 also applies to this case

In the case of 120582 lt 1 as shown by [27] the convergenceof linear TD(120582) can be proved based on the ordinarydifferential equation (ODE) method introduced in [28] Thisresult can be easily extended to KTD(120582) as follows Letus consider the Markov estimation problem as in [6] AnabsorbingMarkov chain can be described by the terminal andnonterminal sets of states T and N transition probabilities119901119894119895between nonterminal states the transition probabilities

119904119894119895from nonterminal states to terminal states the vectors 119909

119894

representing the nonterminal states the expected terminalreturns 119889

119895from the 119895th terminal state and the probabilities

120583119894of starting at state 119894 Given an initial state 119894 isin N an

absorbing Markov chain generates an observation sequenceof 119898 vectors 119909

1198941

1199091198942

119909119894119898

where the last element 119909119894119898

ofthe sequence corresponds to a terminal state 119894

119898isin T The

expected outcome 119889 given a sequence starting at 119894 isin N isgiven by

119890lowast

119894equiv 119864 [119889 | 119894] (13)

= sum

119895isinT

119904119894119895119889119895+ sum

119895isinN

119901119894119895sum

119896isinT

119901119895119896119889119896+ sdot sdot sdot (14)

Computational Intelligence and Neuroscience 5

= [

infin

sum

119896=0

119876119896ℎ]

119894

= [(119868 minus 119876)minus1ℎ]119894 (15)

where [119909]119894denotes the 119894th element of the array 119909 119876 is the

transition matrix with entries [119876]119894119895= 119901119894119895for 119894 119895 isin N and

[ℎ]119894= sum119895isinT 119904119894119895119889119895 for 119894 isin N In linear TD(120582) a sequence

of vectors 1199081 1199082 is generated Each one of these vectors

119908119899is generated after having a complete observation sequence

that is a sequence staring at state 119894 isin N and ending at state119895 isin T with the respective return 119889

119895 Similar to linear TD(120582)

inKTD(120582)we have a sequence of functions1198911 1198912 (vectors

in a RKHS) for which we can also write a linear updateof the mean estimates of terminal return after 119899 sequenceshave been observed If 119891

119899is the actual function estimate after

sequence 119899 and 119891119899+1

is the expected function estimate afterthe next sequence we have that

119891119899+1

(119883) = 119891119899 (119883) + 120578119899+1H (119891119899 (119883) minus 119890

lowast) (16)

where H = minusK119863[119868 minus (1 minus 120582)119876(119868 minus 120582119876)minus1] [K]119894119895= 120581(119909

119894 119909119895)

with 119894 119895 isin N 119863 is a diagonal matrix and [119863]119894119894the expected

number of times the state 119894 is visited during a sequence and119891119899(119883) is a column vector of function evaluations of the state

representations such that [119891119899(119883)]119894= 119891119899(119909119894) = ⟨119891

119899 120601(119909119894)⟩

Analogously to [27] the mean estimates in (16) convergeappropriately if H has a full set of eigenvalues with negativereal parts for which we need K to be full rank For the aboveto be true it is required the set of vectors 120601(119909

119894)119894isinN to be

linearly independent in the RKHS This is exactly the casewhen the kernel 120581 is strictly positive definite as shown in thefollowing proposition

Proposition 3 If 120581 XtimesX rarr R is a strictly positive definitekernel for any finite set 119909

119894119873

119894=1sube X of distinct elements the set

120601(119909119894) is linearly independent

Proof If 120581 is strictly positive definite thensum120572119894120572119895120581(119909119894 119909119895) gt 0

for any set 119909119894where 119909

119894= 119909119895 for all 119894 = 119895 and any 120572

119894isin R

such that not all 120572119894= 0 Suppose there exists a set 119909

119894 for

which 120601(119909119894) are not linearly independent Then there must

be a set of coefficients 120572119894isin R not all equal to zero such that

sum120572119894120601(119909119894) = 0 which implies that sum120572

119894120601(119909119894)2= 0

0 = sum120572119894120572119895⟨120601 (119909119894) 120601 (119909

119895)⟩ = sum120572

119894120572119895120581 (119909119894 119909119895) (17)

which contradicts the assumption

The following Theorem is the resulting extension ofTheorem 119879 in [27] to KTD(120582)

Theorem4 For any absorbingMarkov chain for any distribu-tion of starting probailities120583

119894such that there are not inaccessible

states for any outcome distributions with finite expected values119889119895 for any strictly positive definite kernel 120581 and any set of

observation vectors 119909119894 119894 isin N such that 119909

119894= 119909119895if and only if

119894 = 119895 there exists an 120598 gt 0 such that if 120578119899= 120578 where 0 lt 120578 lt 120598

and for any initial function estimate the predictions of KTD(120582)converge in expected value to the ideal predictions of (15) If 119891

119899

denotes the function estimate after experiencing 119899 sequencesthen

lim119899rarrinfin

119864 [119891119899(119909119894)] = 119864 [119889 | 119894] = [(119868 minus 119876)

minus1ℎ]119894 forall119894 isinN

(18)

4 119876-Learning via Kernel TemporalDifferences(120582)

Since the value function represents the expected cumulativerewards given a policy the policy 120587 is better than the policy1205871015840 when the policy 120587 gives greater expected return than the

policy 1205871015840 In other words 120587 ge 1205871015840 if and only if 119876120587(119909 119886) ge

1198761205871015840

(119909 119886) for all 119909 isin X and 119886 isin A Therefore the optimalaction value function 119876 can be written as 119876lowast(119909(119899) 119886(119899)) =max120587119876120587(119909(119899) 119886(119899)) The estimation can be done online To

maximize the expected reward 119864[119903(119899 + 1) | 119909(119899) 119886(119899) 119909(119899 +1)] one-step 119876-learning update was introduced in [29]

119876 (119909 (119899) 119886 (119899)) larr997888 119876 (119909 (119899) 119886 (119899))

+ 120578 [119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

(19)

At time 119899 an action 119886(119899) can be selected using methods suchas 120598-greedy or the Boltzmann distribution which are popularfor exploration and exploitation trade-off [30]

When we consider the prediction 119910 as action value func-tion 119876120587 with respect to a policy 120587 KTD(120582) can approximatethe value function119876120587 using a family of functions of the form

119876 (119909 (119899) 119886 = 119894) = 119891 (119909 | 119886 = 119894) = ⟨119891 120601 (119909 (119899))⟩ (20)

Here 119876(119909(119899) 119886 = 119894) denotes a state-action value given astate 119909(119899) at time 119899 and a discrete action 119894 Therefore theupdate rule for119876-learning via kernel temporal difference (119876-KTD)(120582) can be written as

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(21)

We can see that the temporal difference (TD) error at time 119899includes reward and action value function terms For single-step prediction problems (119898 = 1) (10) yields single updatesfor 119876-KTD(120582) of the form

119876119894(119909 (119899)) = 120578

119899minus1

sum

119895=1

119890TD119894 (119895) 119868119896 (119895) 120581 ⟨119909 (119899) 119909 (119895)⟩ (22)

Here 119876119894(119909(119899)) = 119876(119909(119899) 119886 = 119894) and 119890TD119894(119899) denotes the TD

error defined as 119890TD119894(119899) = 119903119894 + 120574119876119894119894(119909(119899 + 1)) minus 119876119894(119909(119899)) and119868119896(119899) is an indicator vector of size determined by the number

6 Computational Intelligence and Neuroscience

CalculaterewardExploration

Exploitation

Statevector

Actionx(n)

x(n)

x(1)

x(2)

x(3)

x(n minus 2)

x(n minus 1)

Σ

Σ

ΣQi(x(n))

Reward selected Q value

a(n) = 120578Σnj=1120582

nminusjIk(j)eTD(j)

Figure 2 The structure of 119876-learning via kernel temporal difference(120582)

of outputs (actions) Only the 119896th entry of the vector is setto 1 and the other entries are set to 0 The selection of theaction unit 119896 at time 119899 can be based on a greedy methodTherefore only the weight (parameter vector) correspondingto the winning action gets updated Recall that the reward 119903

119894

corresponds to the action selected by the current policy withinput 119909(119899) because it is assumed that this action causes thenext input state 119909(119899 + 1)

The structure of 119876-learning via KTD(0) is shown inFigure 2 The number of units (kernel evaluations) increasesas more input data arrives Each added unit is centered at theprevious input locations 119909(1) 119909(2) 119909(119899 minus 1)

In the reinforcement learning brain machine interface(RLBMI) paradigm kernel temporal difference(120582) helpsmodel the agent (see Figure 1) The action value function119876 can be approximated using KTD(120582) for which the ker-nel based representations enhance the functional mappingcapabilities of the system Based on the estimated 119876 valuesa policy decides a proper action Note that the policy cor-responds to the learning policy which changes over time in119876-learning

5 Online Sparsification

One characteristic of nonparametric approaches is theirinherently growing structure which is usually linear in thenumber of input data points This rate of growth becomesprohibitive for practical applications that handle increasingamounts of incoming data over time Various methods havebeen proposed to alleviate this problem (see [31] and refer-ences therein)Thesemethods known as kernel sparsificationmethods can be applied to the KTD algorithm to controlthe growth of the terms in the function expansion also

known as filter size Popular examples of kernel sparsificationmethods are the approximate linear dependence (ALD) [19]Surprise criterion [32] Quantization approach [21] andthe kernel distance based method [23] The main idea ofsparsification is to only consider a reduced set of samplescalled the dictionary to represent the function of interestThecomputational complexity ofALD is119874(1198892) where119889 is the sizeof the dictionary For the other methods mentioned abovethe complexity is 119874(119889)

Each of these methods has its own criterion to determinewhether an incoming sample should be added to the currentdictionary The Surprise criterion [32] measures the subjec-tive information of exemplar 119909 119889 with respect to a learningsystem Γ

119878Γ(119909 119889) = minus ln119901 (119909 119889 | Γ) (23)

Only samples with high values of Surprise are considered ascandidates for the dictionary In the case of the Quantizationapproach introduced in [21] the distance between a newinput 119909(119899) and the existing dictionary elements 119862(119899 minus 1) isevaluated The new input sample is added to the dictionaryif the distance between the new input 119909(119899) and the closestelement in 119862(119899 minus 1)

min119909119894isin119862(119899minus1)

1003817100381710038171003817119909 (119899) minus 1199091198941003817100381710038171003817 gt 120598119880 (24)

is larger than the Quantization size 120598119880 Otherwise the new

input state 119909(119899) is absorbed by the closest existing unit Verysimilar to the quantization approach the method presentedin [23] applies a distance threshold criterion in the RKHSThe kernel distance based criterion given a state dictionary

Computational Intelligence and Neuroscience 7

119863(119899 minus 1) adds a new unit when the new input state 119909(119899)satisfies following condition

min119909119894isin119863(119899minus1)

1003817100381710038171003817120601(119909(119899)) minus 120601(119909119894)10038171003817100381710038172gt 1205831 (25)

For some kernels such as Gaussian the Quantizationmethodand the kernel distance based criterion can be shown to beequivalent

6 Simulations

Note that the KTD algorithm has been introduced for valuefunction estimation To evaluate the algorithmrsquos nonlinearcapability we first examine the performance of theKTD(120582) inthe problem of state value function estimation given a fixedpolicy 120587 We carry out experiments on a simple illustrativeMarkov chain initially described in [33] This is a popularexperiment involving an episodic task to test TD learningalgorithms The experiment is useful in illustrating linear aswell as nonlinear functions of the state representations andshows how the state value function is estimated using theadaptive system

61 Linear Case Even though we emphasize the capabilityof KTD(120582) as a nonlinear function approximator underthe appropriate kernel size KTD(120582) should approximatelinear functions on a region of interest as well To test itsefficacy we observe the performance on a simple Markovchain (Figure 3) There are 13 states numbered from 12 to0 Each trial starts at state 12 and terminates at state 0Each state is represented by a 4-dimensional vector and therewards are assigned in such a way that the value function119881 is a linear function on the states namely 119881lowast takes thevalues [0 minus2 minus4 minus22 minus24] at states [0 1 2 11 12]In the case of 119881 = 119908

⊤119909 the optimal weights are 119908lowast =

[minus24 minus16 minus8 0]To assess the performance the updated estimate of the

state value function (119909) is compared to the optimal valuefunction119881lowast at the end of each trialThis is done by computingthe RMS error of the value function over all states

RMS = radic1

119899sum

119909isinX

(119881lowast(119909) minus (119909))2

(26)

where 119899 is the number of states 119899 = 13Stepsize scheduling is applied as follows

120578 (119899) = 1205780

1198860+ 1

1198860+ 119899

where 119899 = 1 2 (27)

where 1205780is the initial stepsize and 119886

0is the annealing

factor which controls how fast the stepsize decreases In thisexperiment 119886

0= 100 is applied Furthermore we assume that

the policy 120587 is guaranteed to terminate which means that thevalue function 119881120587 is well-behaved without using a discountfactor 120574 in (3) that is 120574 = 1

In KTD(120582) we employ the Gaussian kernel

120581 (119909 (119894) 119909 (119895)) = exp(minus1003817100381710038171003817119909(119894) minus 119909(119895)

10038171003817100381710038172

2ℎ2) (28)

Start11 10

End

3 2 1middot middot middot

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3 minus3

minus3

minus3

minus3 minus212 0

[1 0 0 0]

[34 14 0 0]

[0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

[12 12 0 0]

Figure 3 A 13-state Markov chain [33] For states from 2 to 12the state transition probabilities are 05 and the correspondingrewards are minus3 State 1 has state transition probability of 1 to theterminal state 0 and a reward of minus2 States 12 8 4 and 0 havethe 4-dimensional state space representations [1 0 0 0] [0 1 0 0][0 0 1 0] and [0 0 0 1] respectively The representations of theother states are linear interpolations between the above vectors

which is a universal kernel commonly encountered in prac-tice To find the optimal kernel size we fix all the other freeparameters around median values 120582 = 04 and 120578

0= 05

and the average RMS error over 10 Monte Carlo runs iscompared For this specific experiment smaller kernel sizesyield better performance since the state representations arefinite However in general applying too small kernel sizesleads to over-fitting and slow learning In particular choosinga very small kernel makes the algorithm behave very similarto the table look up method Thus we choose the kernel sizeℎ = 02 to be the largest kernel size for which we obtainsimilar mean RMS values as for smaller kernel sizes

After fixing the kernel size to ℎ = 02 the experimentalevaluation of different combinations of eligibility trace rates120582 and initial step sizes 120578

0are observed Figure 4 shows the

average performance over 10 Monte Carlo runs for 1000trials

All 120582 values with optimal stepsize show good approxima-tion to 119881lowast after 1000 trials Notice that KTD(120582 = 0) showsslightly better performance than KTD(120582 = 1) This may beattributed to the local nature ofKTDwhenusing theGaussiankernel In addition varying the stepsize has a relatively smalleffect on KTD(120582) The Gaussian kernel as well as other shift-invariant kernels provide an implicit normalized update rulewhich is known to be less sensitive to stepsize Based onFigure 4 the optimal eligibility trace rate and initial stepsizevalue 120582 = 06 and 120578

0= 03 are selected for KTD with kernel

size ℎ = 02The learning curve of KTD(120582) is compared to the con-

ventional TD algorithm TD(120582) The optimal parametersemployed in both algorithms are based on the experimentalevaluation In TD(120582) 120582 = 1 and 120578

0= 01 are applied The

RMS error is averaged over 50 Monte Carlo runs for 1000trials Comparative learning curves are given in Figure 5

In this experiment we confirm the ability of KTD(120582) tohandle the function approximation problem when the fixedpolicy yields a state value function that is linear in the staterepresentation Both algorithms reach the mean RMS valueof around 006 As we expected TD(120582) converges faster to theoptimal solution because of the linear nature of the problemKTD(120582) converges slower than TD(120582) but it is also ableto approximate the value function properly In this sense

8 Computational Intelligence and Neuroscience

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 4 Performance comparison over different combinations ofeligibility trace rates 120582 and initial step sizes 120578

0in KTD(120582) with ℎ =

02 The vertical line segments contain the mean RMS value after100 trials (top marker) 500 trials (middle marker) and 1000 trials(bottom marker)

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDTD

Figure 5 Learning curve of KTD(120582) and TD(120582) The solid lineshows the mean RMS error and the dashed line shows the +minusstandard deviations over 50Monte Carlo runs

the KTD algorithm is open to wider class of problems thanits linear counterpart

62 Nonlinear Case Previous section show the performancesof KTD(120582) on the problem of estimating a state value

Start11 10

End

3 2 112 0middot middot middot

[1 0 0 0]

[34 14 0 0]

[12 12 0 0] [0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

minus8

minus8

minus4 minus4

minus4

minus2 minus2

minus2minus2

minus1

minus1

minus05

minus05 minus02

Figure 6 A 13-state Markov chain In states from 2 to 12 each statetransition has probability 05 and state 1 has transition probability1 to the absorbing state 0 Note that optimal state value functionscan be represented as a nonlinear function of the states andcorresponding reward values are assigned to each state

function which is a linear function of the given state repre-sentation The same problem can be turned into a nonlinearone bymodifying the reward values in the chain such that theresulting state value function119881lowast is no longer a linear functionof the states

The number of states and the state representations remainthe same as in the previous section However the optimalvalue function 119881

lowast becomes nonlinear with respect tothe representation of the states namely119881lowast = [0 minus02 minus06

minus 14 minus 3 minus 62 minus 126 minus 134 minus 135 minus 1445 minus 15975

minus 192125 minus 255938] for states 0 to 12 This implies that thereward values for each state are different from the ones givenfor the linear case (Figure 6)

Again to evaluate the performance after each trial iscompleted the estimated state value is compared to theoptimal state value 119881lowast using RMS error (26) For KTD(120582)the Gaussian kernel (28) is applied and kernel size ℎ = 02 ischosen Figure 7 shows the average RMS error over 10MonteCarlo runs for 1000 trials

The combination of 120582 = 04 and 1205780= 03 shows the best

performance but the 120582 = 0 case also shows good perfor-mances Unlike TD(120582) [6] there is no dominant value for 120582in KTD(120582) Recall that it has been proved that convergenceis guaranteed for linearly independent representations of thestates which is automatically fulfilled in KTD(120582) when thekernel is strictly positive definite Therefore the differencesare rather due to the convergence speed controlled by theinteraction between the step size and the elegibilty trace

The average RMS error over 50Monte Carlo runs is com-pared with Gaussian process temporal difference (GPTD)[15] and TD(120582) in Figure 8The purpose of GPTD implemen-tation is to have comparison among kernelized value functionapproximations Here the applied optimal parameters forKTD(120582) are 120582 = 04 120578

0= 03 and ℎ = 02 for GPTD 120582 = 1

1205902= 05 and ℎ = 02 and for TD(120582) 120582 = 08 and 120578

0= 01

The linear function approximation TD(120582) (blue line)cannot estimate the optimal state values KTD(120582) outper-forms the linear algorithm as expected since the Gaussiankernel is strictly positive definite GPTD also learns the targetstate values but the system fails to reach as low error valuesas KTD GPTD is sensitive to the selection of the covariancevalue in the noise1205902 if the value is small the system becomesunstable and larger values cause the the learning to slowdown GPTD models the residuals the difference between

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

Computational Intelligence and Neuroscience 5

= [

infin

sum

119896=0

119876119896ℎ]

119894

= [(119868 minus 119876)minus1ℎ]119894 (15)

where [119909]119894denotes the 119894th element of the array 119909 119876 is the

transition matrix with entries [119876]119894119895= 119901119894119895for 119894 119895 isin N and

[ℎ]119894= sum119895isinT 119904119894119895119889119895 for 119894 isin N In linear TD(120582) a sequence

of vectors 1199081 1199082 is generated Each one of these vectors

119908119899is generated after having a complete observation sequence

that is a sequence staring at state 119894 isin N and ending at state119895 isin T with the respective return 119889

119895 Similar to linear TD(120582)

inKTD(120582)we have a sequence of functions1198911 1198912 (vectors

in a RKHS) for which we can also write a linear updateof the mean estimates of terminal return after 119899 sequenceshave been observed If 119891

119899is the actual function estimate after

sequence 119899 and 119891119899+1

is the expected function estimate afterthe next sequence we have that

119891119899+1

(119883) = 119891119899 (119883) + 120578119899+1H (119891119899 (119883) minus 119890

lowast) (16)

where H = minusK119863[119868 minus (1 minus 120582)119876(119868 minus 120582119876)minus1] [K]119894119895= 120581(119909

119894 119909119895)

with 119894 119895 isin N 119863 is a diagonal matrix and [119863]119894119894the expected

number of times the state 119894 is visited during a sequence and119891119899(119883) is a column vector of function evaluations of the state

representations such that [119891119899(119883)]119894= 119891119899(119909119894) = ⟨119891

119899 120601(119909119894)⟩

Analogously to [27] the mean estimates in (16) convergeappropriately if H has a full set of eigenvalues with negativereal parts for which we need K to be full rank For the aboveto be true it is required the set of vectors 120601(119909

119894)119894isinN to be

linearly independent in the RKHS This is exactly the casewhen the kernel 120581 is strictly positive definite as shown in thefollowing proposition

Proposition 3 If 120581 XtimesX rarr R is a strictly positive definitekernel for any finite set 119909

119894119873

119894=1sube X of distinct elements the set

120601(119909119894) is linearly independent

Proof If 120581 is strictly positive definite thensum120572119894120572119895120581(119909119894 119909119895) gt 0

for any set 119909119894where 119909

119894= 119909119895 for all 119894 = 119895 and any 120572

119894isin R

such that not all 120572119894= 0 Suppose there exists a set 119909

119894 for

which 120601(119909119894) are not linearly independent Then there must

be a set of coefficients 120572119894isin R not all equal to zero such that

sum120572119894120601(119909119894) = 0 which implies that sum120572

119894120601(119909119894)2= 0

0 = sum120572119894120572119895⟨120601 (119909119894) 120601 (119909

119895)⟩ = sum120572

119894120572119895120581 (119909119894 119909119895) (17)

which contradicts the assumption

The following Theorem is the resulting extension ofTheorem 119879 in [27] to KTD(120582)

Theorem4 For any absorbingMarkov chain for any distribu-tion of starting probailities120583

119894such that there are not inaccessible

states for any outcome distributions with finite expected values119889119895 for any strictly positive definite kernel 120581 and any set of

observation vectors 119909119894 119894 isin N such that 119909

119894= 119909119895if and only if

119894 = 119895 there exists an 120598 gt 0 such that if 120578119899= 120578 where 0 lt 120578 lt 120598

and for any initial function estimate the predictions of KTD(120582)converge in expected value to the ideal predictions of (15) If 119891

119899

denotes the function estimate after experiencing 119899 sequencesthen

lim119899rarrinfin

119864 [119891119899(119909119894)] = 119864 [119889 | 119894] = [(119868 minus 119876)

minus1ℎ]119894 forall119894 isinN

(18)

4 119876-Learning via Kernel TemporalDifferences(120582)

Since the value function represents the expected cumulativerewards given a policy the policy 120587 is better than the policy1205871015840 when the policy 120587 gives greater expected return than the

policy 1205871015840 In other words 120587 ge 1205871015840 if and only if 119876120587(119909 119886) ge

1198761205871015840

(119909 119886) for all 119909 isin X and 119886 isin A Therefore the optimalaction value function 119876 can be written as 119876lowast(119909(119899) 119886(119899)) =max120587119876120587(119909(119899) 119886(119899)) The estimation can be done online To

maximize the expected reward 119864[119903(119899 + 1) | 119909(119899) 119886(119899) 119909(119899 +1)] one-step 119876-learning update was introduced in [29]

119876 (119909 (119899) 119886 (119899)) larr997888 119876 (119909 (119899) 119886 (119899))

+ 120578 [119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

(19)

At time 119899 an action 119886(119899) can be selected using methods suchas 120598-greedy or the Boltzmann distribution which are popularfor exploration and exploitation trade-off [30]

When we consider the prediction 119910 as action value func-tion 119876120587 with respect to a policy 120587 KTD(120582) can approximatethe value function119876120587 using a family of functions of the form

119876 (119909 (119899) 119886 = 119894) = 119891 (119909 | 119886 = 119894) = ⟨119891 120601 (119909 (119899))⟩ (20)

Here 119876(119909(119899) 119886 = 119894) denotes a state-action value given astate 119909(119899) at time 119899 and a discrete action 119894 Therefore theupdate rule for119876-learning via kernel temporal difference (119876-KTD)(120582) can be written as

119891 larr997888 119891 + 120578

119898

sum

119899=1

[119903 (119899 + 1) + 120574max119886119876 (119909 (119899 + 1) 119886)

minus 119876 (119909 (119899) 119886 (119899))]

119899

sum

119896=1

120582119899minus119896120601 (119909 (119896))

(21)

We can see that the temporal difference (TD) error at time 119899includes reward and action value function terms For single-step prediction problems (119898 = 1) (10) yields single updatesfor 119876-KTD(120582) of the form

119876119894(119909 (119899)) = 120578

119899minus1

sum

119895=1

119890TD119894 (119895) 119868119896 (119895) 120581 ⟨119909 (119899) 119909 (119895)⟩ (22)

Here 119876119894(119909(119899)) = 119876(119909(119899) 119886 = 119894) and 119890TD119894(119899) denotes the TD

error defined as 119890TD119894(119899) = 119903119894 + 120574119876119894119894(119909(119899 + 1)) minus 119876119894(119909(119899)) and119868119896(119899) is an indicator vector of size determined by the number

6 Computational Intelligence and Neuroscience

CalculaterewardExploration

Exploitation

Statevector

Actionx(n)

x(n)

x(1)

x(2)

x(3)

x(n minus 2)

x(n minus 1)

Σ

Σ

ΣQi(x(n))

Reward selected Q value

a(n) = 120578Σnj=1120582

nminusjIk(j)eTD(j)

Figure 2 The structure of 119876-learning via kernel temporal difference(120582)

of outputs (actions) Only the 119896th entry of the vector is setto 1 and the other entries are set to 0 The selection of theaction unit 119896 at time 119899 can be based on a greedy methodTherefore only the weight (parameter vector) correspondingto the winning action gets updated Recall that the reward 119903

119894

corresponds to the action selected by the current policy withinput 119909(119899) because it is assumed that this action causes thenext input state 119909(119899 + 1)

The structure of 119876-learning via KTD(0) is shown inFigure 2 The number of units (kernel evaluations) increasesas more input data arrives Each added unit is centered at theprevious input locations 119909(1) 119909(2) 119909(119899 minus 1)

In the reinforcement learning brain machine interface(RLBMI) paradigm kernel temporal difference(120582) helpsmodel the agent (see Figure 1) The action value function119876 can be approximated using KTD(120582) for which the ker-nel based representations enhance the functional mappingcapabilities of the system Based on the estimated 119876 valuesa policy decides a proper action Note that the policy cor-responds to the learning policy which changes over time in119876-learning

5 Online Sparsification

One characteristic of nonparametric approaches is theirinherently growing structure which is usually linear in thenumber of input data points This rate of growth becomesprohibitive for practical applications that handle increasingamounts of incoming data over time Various methods havebeen proposed to alleviate this problem (see [31] and refer-ences therein)Thesemethods known as kernel sparsificationmethods can be applied to the KTD algorithm to controlthe growth of the terms in the function expansion also

known as filter size Popular examples of kernel sparsificationmethods are the approximate linear dependence (ALD) [19]Surprise criterion [32] Quantization approach [21] andthe kernel distance based method [23] The main idea ofsparsification is to only consider a reduced set of samplescalled the dictionary to represent the function of interestThecomputational complexity ofALD is119874(1198892) where119889 is the sizeof the dictionary For the other methods mentioned abovethe complexity is 119874(119889)

Each of these methods has its own criterion to determinewhether an incoming sample should be added to the currentdictionary The Surprise criterion [32] measures the subjec-tive information of exemplar 119909 119889 with respect to a learningsystem Γ

119878Γ(119909 119889) = minus ln119901 (119909 119889 | Γ) (23)

Only samples with high values of Surprise are considered ascandidates for the dictionary In the case of the Quantizationapproach introduced in [21] the distance between a newinput 119909(119899) and the existing dictionary elements 119862(119899 minus 1) isevaluated The new input sample is added to the dictionaryif the distance between the new input 119909(119899) and the closestelement in 119862(119899 minus 1)

min119909119894isin119862(119899minus1)

1003817100381710038171003817119909 (119899) minus 1199091198941003817100381710038171003817 gt 120598119880 (24)

is larger than the Quantization size 120598119880 Otherwise the new

input state 119909(119899) is absorbed by the closest existing unit Verysimilar to the quantization approach the method presentedin [23] applies a distance threshold criterion in the RKHSThe kernel distance based criterion given a state dictionary

Computational Intelligence and Neuroscience 7

119863(119899 minus 1) adds a new unit when the new input state 119909(119899)satisfies following condition

min119909119894isin119863(119899minus1)

1003817100381710038171003817120601(119909(119899)) minus 120601(119909119894)10038171003817100381710038172gt 1205831 (25)

For some kernels such as Gaussian the Quantizationmethodand the kernel distance based criterion can be shown to beequivalent

6 Simulations

Note that the KTD algorithm has been introduced for valuefunction estimation To evaluate the algorithmrsquos nonlinearcapability we first examine the performance of theKTD(120582) inthe problem of state value function estimation given a fixedpolicy 120587 We carry out experiments on a simple illustrativeMarkov chain initially described in [33] This is a popularexperiment involving an episodic task to test TD learningalgorithms The experiment is useful in illustrating linear aswell as nonlinear functions of the state representations andshows how the state value function is estimated using theadaptive system

61 Linear Case Even though we emphasize the capabilityof KTD(120582) as a nonlinear function approximator underthe appropriate kernel size KTD(120582) should approximatelinear functions on a region of interest as well To test itsefficacy we observe the performance on a simple Markovchain (Figure 3) There are 13 states numbered from 12 to0 Each trial starts at state 12 and terminates at state 0Each state is represented by a 4-dimensional vector and therewards are assigned in such a way that the value function119881 is a linear function on the states namely 119881lowast takes thevalues [0 minus2 minus4 minus22 minus24] at states [0 1 2 11 12]In the case of 119881 = 119908

⊤119909 the optimal weights are 119908lowast =

[minus24 minus16 minus8 0]To assess the performance the updated estimate of the

state value function (119909) is compared to the optimal valuefunction119881lowast at the end of each trialThis is done by computingthe RMS error of the value function over all states

RMS = radic1

119899sum

119909isinX

(119881lowast(119909) minus (119909))2

(26)

where 119899 is the number of states 119899 = 13Stepsize scheduling is applied as follows

120578 (119899) = 1205780

1198860+ 1

1198860+ 119899

where 119899 = 1 2 (27)

where 1205780is the initial stepsize and 119886

0is the annealing

factor which controls how fast the stepsize decreases In thisexperiment 119886

0= 100 is applied Furthermore we assume that

the policy 120587 is guaranteed to terminate which means that thevalue function 119881120587 is well-behaved without using a discountfactor 120574 in (3) that is 120574 = 1

In KTD(120582) we employ the Gaussian kernel

120581 (119909 (119894) 119909 (119895)) = exp(minus1003817100381710038171003817119909(119894) minus 119909(119895)

10038171003817100381710038172

2ℎ2) (28)

Start11 10

End

3 2 1middot middot middot

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3 minus3

minus3

minus3

minus3 minus212 0

[1 0 0 0]

[34 14 0 0]

[0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

[12 12 0 0]

Figure 3 A 13-state Markov chain [33] For states from 2 to 12the state transition probabilities are 05 and the correspondingrewards are minus3 State 1 has state transition probability of 1 to theterminal state 0 and a reward of minus2 States 12 8 4 and 0 havethe 4-dimensional state space representations [1 0 0 0] [0 1 0 0][0 0 1 0] and [0 0 0 1] respectively The representations of theother states are linear interpolations between the above vectors

which is a universal kernel commonly encountered in prac-tice To find the optimal kernel size we fix all the other freeparameters around median values 120582 = 04 and 120578

0= 05

and the average RMS error over 10 Monte Carlo runs iscompared For this specific experiment smaller kernel sizesyield better performance since the state representations arefinite However in general applying too small kernel sizesleads to over-fitting and slow learning In particular choosinga very small kernel makes the algorithm behave very similarto the table look up method Thus we choose the kernel sizeℎ = 02 to be the largest kernel size for which we obtainsimilar mean RMS values as for smaller kernel sizes

After fixing the kernel size to ℎ = 02 the experimentalevaluation of different combinations of eligibility trace rates120582 and initial step sizes 120578

0are observed Figure 4 shows the

average performance over 10 Monte Carlo runs for 1000trials

All 120582 values with optimal stepsize show good approxima-tion to 119881lowast after 1000 trials Notice that KTD(120582 = 0) showsslightly better performance than KTD(120582 = 1) This may beattributed to the local nature ofKTDwhenusing theGaussiankernel In addition varying the stepsize has a relatively smalleffect on KTD(120582) The Gaussian kernel as well as other shift-invariant kernels provide an implicit normalized update rulewhich is known to be less sensitive to stepsize Based onFigure 4 the optimal eligibility trace rate and initial stepsizevalue 120582 = 06 and 120578

0= 03 are selected for KTD with kernel

size ℎ = 02The learning curve of KTD(120582) is compared to the con-

ventional TD algorithm TD(120582) The optimal parametersemployed in both algorithms are based on the experimentalevaluation In TD(120582) 120582 = 1 and 120578

0= 01 are applied The

RMS error is averaged over 50 Monte Carlo runs for 1000trials Comparative learning curves are given in Figure 5

In this experiment we confirm the ability of KTD(120582) tohandle the function approximation problem when the fixedpolicy yields a state value function that is linear in the staterepresentation Both algorithms reach the mean RMS valueof around 006 As we expected TD(120582) converges faster to theoptimal solution because of the linear nature of the problemKTD(120582) converges slower than TD(120582) but it is also ableto approximate the value function properly In this sense

8 Computational Intelligence and Neuroscience

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 4 Performance comparison over different combinations ofeligibility trace rates 120582 and initial step sizes 120578

0in KTD(120582) with ℎ =

02 The vertical line segments contain the mean RMS value after100 trials (top marker) 500 trials (middle marker) and 1000 trials(bottom marker)

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDTD

Figure 5 Learning curve of KTD(120582) and TD(120582) The solid lineshows the mean RMS error and the dashed line shows the +minusstandard deviations over 50Monte Carlo runs

the KTD algorithm is open to wider class of problems thanits linear counterpart

62 Nonlinear Case Previous section show the performancesof KTD(120582) on the problem of estimating a state value

Start11 10

End

3 2 112 0middot middot middot

[1 0 0 0]

[34 14 0 0]

[12 12 0 0] [0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

minus8

minus8

minus4 minus4

minus4

minus2 minus2

minus2minus2

minus1

minus1

minus05

minus05 minus02

Figure 6 A 13-state Markov chain In states from 2 to 12 each statetransition has probability 05 and state 1 has transition probability1 to the absorbing state 0 Note that optimal state value functionscan be represented as a nonlinear function of the states andcorresponding reward values are assigned to each state

function which is a linear function of the given state repre-sentation The same problem can be turned into a nonlinearone bymodifying the reward values in the chain such that theresulting state value function119881lowast is no longer a linear functionof the states

The number of states and the state representations remainthe same as in the previous section However the optimalvalue function 119881

lowast becomes nonlinear with respect tothe representation of the states namely119881lowast = [0 minus02 minus06

minus 14 minus 3 minus 62 minus 126 minus 134 minus 135 minus 1445 minus 15975

minus 192125 minus 255938] for states 0 to 12 This implies that thereward values for each state are different from the ones givenfor the linear case (Figure 6)

Again to evaluate the performance after each trial iscompleted the estimated state value is compared to theoptimal state value 119881lowast using RMS error (26) For KTD(120582)the Gaussian kernel (28) is applied and kernel size ℎ = 02 ischosen Figure 7 shows the average RMS error over 10MonteCarlo runs for 1000 trials

The combination of 120582 = 04 and 1205780= 03 shows the best

performance but the 120582 = 0 case also shows good perfor-mances Unlike TD(120582) [6] there is no dominant value for 120582in KTD(120582) Recall that it has been proved that convergenceis guaranteed for linearly independent representations of thestates which is automatically fulfilled in KTD(120582) when thekernel is strictly positive definite Therefore the differencesare rather due to the convergence speed controlled by theinteraction between the step size and the elegibilty trace

The average RMS error over 50Monte Carlo runs is com-pared with Gaussian process temporal difference (GPTD)[15] and TD(120582) in Figure 8The purpose of GPTD implemen-tation is to have comparison among kernelized value functionapproximations Here the applied optimal parameters forKTD(120582) are 120582 = 04 120578

0= 03 and ℎ = 02 for GPTD 120582 = 1

1205902= 05 and ℎ = 02 and for TD(120582) 120582 = 08 and 120578

0= 01

The linear function approximation TD(120582) (blue line)cannot estimate the optimal state values KTD(120582) outper-forms the linear algorithm as expected since the Gaussiankernel is strictly positive definite GPTD also learns the targetstate values but the system fails to reach as low error valuesas KTD GPTD is sensitive to the selection of the covariancevalue in the noise1205902 if the value is small the system becomesunstable and larger values cause the the learning to slowdown GPTD models the residuals the difference between

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

6 Computational Intelligence and Neuroscience

CalculaterewardExploration

Exploitation

Statevector

Actionx(n)

x(n)

x(1)

x(2)

x(3)

x(n minus 2)

x(n minus 1)

Σ

Σ

ΣQi(x(n))

Reward selected Q value

a(n) = 120578Σnj=1120582

nminusjIk(j)eTD(j)

Figure 2 The structure of 119876-learning via kernel temporal difference(120582)

of outputs (actions) Only the 119896th entry of the vector is setto 1 and the other entries are set to 0 The selection of theaction unit 119896 at time 119899 can be based on a greedy methodTherefore only the weight (parameter vector) correspondingto the winning action gets updated Recall that the reward 119903

119894

corresponds to the action selected by the current policy withinput 119909(119899) because it is assumed that this action causes thenext input state 119909(119899 + 1)

The structure of 119876-learning via KTD(0) is shown inFigure 2 The number of units (kernel evaluations) increasesas more input data arrives Each added unit is centered at theprevious input locations 119909(1) 119909(2) 119909(119899 minus 1)

In the reinforcement learning brain machine interface(RLBMI) paradigm kernel temporal difference(120582) helpsmodel the agent (see Figure 1) The action value function119876 can be approximated using KTD(120582) for which the ker-nel based representations enhance the functional mappingcapabilities of the system Based on the estimated 119876 valuesa policy decides a proper action Note that the policy cor-responds to the learning policy which changes over time in119876-learning

5 Online Sparsification

One characteristic of nonparametric approaches is theirinherently growing structure which is usually linear in thenumber of input data points This rate of growth becomesprohibitive for practical applications that handle increasingamounts of incoming data over time Various methods havebeen proposed to alleviate this problem (see [31] and refer-ences therein)Thesemethods known as kernel sparsificationmethods can be applied to the KTD algorithm to controlthe growth of the terms in the function expansion also

known as filter size Popular examples of kernel sparsificationmethods are the approximate linear dependence (ALD) [19]Surprise criterion [32] Quantization approach [21] andthe kernel distance based method [23] The main idea ofsparsification is to only consider a reduced set of samplescalled the dictionary to represent the function of interestThecomputational complexity ofALD is119874(1198892) where119889 is the sizeof the dictionary For the other methods mentioned abovethe complexity is 119874(119889)

Each of these methods has its own criterion to determinewhether an incoming sample should be added to the currentdictionary The Surprise criterion [32] measures the subjec-tive information of exemplar 119909 119889 with respect to a learningsystem Γ

119878Γ(119909 119889) = minus ln119901 (119909 119889 | Γ) (23)

Only samples with high values of Surprise are considered ascandidates for the dictionary In the case of the Quantizationapproach introduced in [21] the distance between a newinput 119909(119899) and the existing dictionary elements 119862(119899 minus 1) isevaluated The new input sample is added to the dictionaryif the distance between the new input 119909(119899) and the closestelement in 119862(119899 minus 1)

min119909119894isin119862(119899minus1)

1003817100381710038171003817119909 (119899) minus 1199091198941003817100381710038171003817 gt 120598119880 (24)

is larger than the Quantization size 120598119880 Otherwise the new

input state 119909(119899) is absorbed by the closest existing unit Verysimilar to the quantization approach the method presentedin [23] applies a distance threshold criterion in the RKHSThe kernel distance based criterion given a state dictionary

Computational Intelligence and Neuroscience 7

119863(119899 minus 1) adds a new unit when the new input state 119909(119899)satisfies following condition

min119909119894isin119863(119899minus1)

1003817100381710038171003817120601(119909(119899)) minus 120601(119909119894)10038171003817100381710038172gt 1205831 (25)

For some kernels such as Gaussian the Quantizationmethodand the kernel distance based criterion can be shown to beequivalent

6 Simulations

Note that the KTD algorithm has been introduced for valuefunction estimation To evaluate the algorithmrsquos nonlinearcapability we first examine the performance of theKTD(120582) inthe problem of state value function estimation given a fixedpolicy 120587 We carry out experiments on a simple illustrativeMarkov chain initially described in [33] This is a popularexperiment involving an episodic task to test TD learningalgorithms The experiment is useful in illustrating linear aswell as nonlinear functions of the state representations andshows how the state value function is estimated using theadaptive system

61 Linear Case Even though we emphasize the capabilityof KTD(120582) as a nonlinear function approximator underthe appropriate kernel size KTD(120582) should approximatelinear functions on a region of interest as well To test itsefficacy we observe the performance on a simple Markovchain (Figure 3) There are 13 states numbered from 12 to0 Each trial starts at state 12 and terminates at state 0Each state is represented by a 4-dimensional vector and therewards are assigned in such a way that the value function119881 is a linear function on the states namely 119881lowast takes thevalues [0 minus2 minus4 minus22 minus24] at states [0 1 2 11 12]In the case of 119881 = 119908

⊤119909 the optimal weights are 119908lowast =

[minus24 minus16 minus8 0]To assess the performance the updated estimate of the

state value function (119909) is compared to the optimal valuefunction119881lowast at the end of each trialThis is done by computingthe RMS error of the value function over all states

RMS = radic1

119899sum

119909isinX

(119881lowast(119909) minus (119909))2

(26)

where 119899 is the number of states 119899 = 13Stepsize scheduling is applied as follows

120578 (119899) = 1205780

1198860+ 1

1198860+ 119899

where 119899 = 1 2 (27)

where 1205780is the initial stepsize and 119886

0is the annealing

factor which controls how fast the stepsize decreases In thisexperiment 119886

0= 100 is applied Furthermore we assume that

the policy 120587 is guaranteed to terminate which means that thevalue function 119881120587 is well-behaved without using a discountfactor 120574 in (3) that is 120574 = 1

In KTD(120582) we employ the Gaussian kernel

120581 (119909 (119894) 119909 (119895)) = exp(minus1003817100381710038171003817119909(119894) minus 119909(119895)

10038171003817100381710038172

2ℎ2) (28)

Start11 10

End

3 2 1middot middot middot

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3 minus3

minus3

minus3

minus3 minus212 0

[1 0 0 0]

[34 14 0 0]

[0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

[12 12 0 0]

Figure 3 A 13-state Markov chain [33] For states from 2 to 12the state transition probabilities are 05 and the correspondingrewards are minus3 State 1 has state transition probability of 1 to theterminal state 0 and a reward of minus2 States 12 8 4 and 0 havethe 4-dimensional state space representations [1 0 0 0] [0 1 0 0][0 0 1 0] and [0 0 0 1] respectively The representations of theother states are linear interpolations between the above vectors

which is a universal kernel commonly encountered in prac-tice To find the optimal kernel size we fix all the other freeparameters around median values 120582 = 04 and 120578

0= 05

and the average RMS error over 10 Monte Carlo runs iscompared For this specific experiment smaller kernel sizesyield better performance since the state representations arefinite However in general applying too small kernel sizesleads to over-fitting and slow learning In particular choosinga very small kernel makes the algorithm behave very similarto the table look up method Thus we choose the kernel sizeℎ = 02 to be the largest kernel size for which we obtainsimilar mean RMS values as for smaller kernel sizes

After fixing the kernel size to ℎ = 02 the experimentalevaluation of different combinations of eligibility trace rates120582 and initial step sizes 120578

0are observed Figure 4 shows the

average performance over 10 Monte Carlo runs for 1000trials

All 120582 values with optimal stepsize show good approxima-tion to 119881lowast after 1000 trials Notice that KTD(120582 = 0) showsslightly better performance than KTD(120582 = 1) This may beattributed to the local nature ofKTDwhenusing theGaussiankernel In addition varying the stepsize has a relatively smalleffect on KTD(120582) The Gaussian kernel as well as other shift-invariant kernels provide an implicit normalized update rulewhich is known to be less sensitive to stepsize Based onFigure 4 the optimal eligibility trace rate and initial stepsizevalue 120582 = 06 and 120578

0= 03 are selected for KTD with kernel

size ℎ = 02The learning curve of KTD(120582) is compared to the con-

ventional TD algorithm TD(120582) The optimal parametersemployed in both algorithms are based on the experimentalevaluation In TD(120582) 120582 = 1 and 120578

0= 01 are applied The

RMS error is averaged over 50 Monte Carlo runs for 1000trials Comparative learning curves are given in Figure 5

In this experiment we confirm the ability of KTD(120582) tohandle the function approximation problem when the fixedpolicy yields a state value function that is linear in the staterepresentation Both algorithms reach the mean RMS valueof around 006 As we expected TD(120582) converges faster to theoptimal solution because of the linear nature of the problemKTD(120582) converges slower than TD(120582) but it is also ableto approximate the value function properly In this sense

8 Computational Intelligence and Neuroscience

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 4 Performance comparison over different combinations ofeligibility trace rates 120582 and initial step sizes 120578

0in KTD(120582) with ℎ =

02 The vertical line segments contain the mean RMS value after100 trials (top marker) 500 trials (middle marker) and 1000 trials(bottom marker)

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDTD

Figure 5 Learning curve of KTD(120582) and TD(120582) The solid lineshows the mean RMS error and the dashed line shows the +minusstandard deviations over 50Monte Carlo runs

the KTD algorithm is open to wider class of problems thanits linear counterpart

62 Nonlinear Case Previous section show the performancesof KTD(120582) on the problem of estimating a state value

Start11 10

End

3 2 112 0middot middot middot

[1 0 0 0]

[34 14 0 0]

[12 12 0 0] [0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

minus8

minus8

minus4 minus4

minus4

minus2 minus2

minus2minus2

minus1

minus1

minus05

minus05 minus02

Figure 6 A 13-state Markov chain In states from 2 to 12 each statetransition has probability 05 and state 1 has transition probability1 to the absorbing state 0 Note that optimal state value functionscan be represented as a nonlinear function of the states andcorresponding reward values are assigned to each state

function which is a linear function of the given state repre-sentation The same problem can be turned into a nonlinearone bymodifying the reward values in the chain such that theresulting state value function119881lowast is no longer a linear functionof the states

The number of states and the state representations remainthe same as in the previous section However the optimalvalue function 119881

lowast becomes nonlinear with respect tothe representation of the states namely119881lowast = [0 minus02 minus06

minus 14 minus 3 minus 62 minus 126 minus 134 minus 135 minus 1445 minus 15975

minus 192125 minus 255938] for states 0 to 12 This implies that thereward values for each state are different from the ones givenfor the linear case (Figure 6)

Again to evaluate the performance after each trial iscompleted the estimated state value is compared to theoptimal state value 119881lowast using RMS error (26) For KTD(120582)the Gaussian kernel (28) is applied and kernel size ℎ = 02 ischosen Figure 7 shows the average RMS error over 10MonteCarlo runs for 1000 trials

The combination of 120582 = 04 and 1205780= 03 shows the best

performance but the 120582 = 0 case also shows good perfor-mances Unlike TD(120582) [6] there is no dominant value for 120582in KTD(120582) Recall that it has been proved that convergenceis guaranteed for linearly independent representations of thestates which is automatically fulfilled in KTD(120582) when thekernel is strictly positive definite Therefore the differencesare rather due to the convergence speed controlled by theinteraction between the step size and the elegibilty trace

The average RMS error over 50Monte Carlo runs is com-pared with Gaussian process temporal difference (GPTD)[15] and TD(120582) in Figure 8The purpose of GPTD implemen-tation is to have comparison among kernelized value functionapproximations Here the applied optimal parameters forKTD(120582) are 120582 = 04 120578

0= 03 and ℎ = 02 for GPTD 120582 = 1

1205902= 05 and ℎ = 02 and for TD(120582) 120582 = 08 and 120578

0= 01

The linear function approximation TD(120582) (blue line)cannot estimate the optimal state values KTD(120582) outper-forms the linear algorithm as expected since the Gaussiankernel is strictly positive definite GPTD also learns the targetstate values but the system fails to reach as low error valuesas KTD GPTD is sensitive to the selection of the covariancevalue in the noise1205902 if the value is small the system becomesunstable and larger values cause the the learning to slowdown GPTD models the residuals the difference between

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

Computational Intelligence and Neuroscience 7

119863(119899 minus 1) adds a new unit when the new input state 119909(119899)satisfies following condition

min119909119894isin119863(119899minus1)

1003817100381710038171003817120601(119909(119899)) minus 120601(119909119894)10038171003817100381710038172gt 1205831 (25)

For some kernels such as Gaussian the Quantizationmethodand the kernel distance based criterion can be shown to beequivalent

6 Simulations

Note that the KTD algorithm has been introduced for valuefunction estimation To evaluate the algorithmrsquos nonlinearcapability we first examine the performance of theKTD(120582) inthe problem of state value function estimation given a fixedpolicy 120587 We carry out experiments on a simple illustrativeMarkov chain initially described in [33] This is a popularexperiment involving an episodic task to test TD learningalgorithms The experiment is useful in illustrating linear aswell as nonlinear functions of the state representations andshows how the state value function is estimated using theadaptive system

61 Linear Case Even though we emphasize the capabilityof KTD(120582) as a nonlinear function approximator underthe appropriate kernel size KTD(120582) should approximatelinear functions on a region of interest as well To test itsefficacy we observe the performance on a simple Markovchain (Figure 3) There are 13 states numbered from 12 to0 Each trial starts at state 12 and terminates at state 0Each state is represented by a 4-dimensional vector and therewards are assigned in such a way that the value function119881 is a linear function on the states namely 119881lowast takes thevalues [0 minus2 minus4 minus22 minus24] at states [0 1 2 11 12]In the case of 119881 = 119908

⊤119909 the optimal weights are 119908lowast =

[minus24 minus16 minus8 0]To assess the performance the updated estimate of the

state value function (119909) is compared to the optimal valuefunction119881lowast at the end of each trialThis is done by computingthe RMS error of the value function over all states

RMS = radic1

119899sum

119909isinX

(119881lowast(119909) minus (119909))2

(26)

where 119899 is the number of states 119899 = 13Stepsize scheduling is applied as follows

120578 (119899) = 1205780

1198860+ 1

1198860+ 119899

where 119899 = 1 2 (27)

where 1205780is the initial stepsize and 119886

0is the annealing

factor which controls how fast the stepsize decreases In thisexperiment 119886

0= 100 is applied Furthermore we assume that

the policy 120587 is guaranteed to terminate which means that thevalue function 119881120587 is well-behaved without using a discountfactor 120574 in (3) that is 120574 = 1

In KTD(120582) we employ the Gaussian kernel

120581 (119909 (119894) 119909 (119895)) = exp(minus1003817100381710038171003817119909(119894) minus 119909(119895)

10038171003817100381710038172

2ℎ2) (28)

Start11 10

End

3 2 1middot middot middot

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3

minus3 minus3

minus3

minus3

minus3 minus212 0

[1 0 0 0]

[34 14 0 0]

[0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

[12 12 0 0]

Figure 3 A 13-state Markov chain [33] For states from 2 to 12the state transition probabilities are 05 and the correspondingrewards are minus3 State 1 has state transition probability of 1 to theterminal state 0 and a reward of minus2 States 12 8 4 and 0 havethe 4-dimensional state space representations [1 0 0 0] [0 1 0 0][0 0 1 0] and [0 0 0 1] respectively The representations of theother states are linear interpolations between the above vectors

which is a universal kernel commonly encountered in prac-tice To find the optimal kernel size we fix all the other freeparameters around median values 120582 = 04 and 120578

0= 05

and the average RMS error over 10 Monte Carlo runs iscompared For this specific experiment smaller kernel sizesyield better performance since the state representations arefinite However in general applying too small kernel sizesleads to over-fitting and slow learning In particular choosinga very small kernel makes the algorithm behave very similarto the table look up method Thus we choose the kernel sizeℎ = 02 to be the largest kernel size for which we obtainsimilar mean RMS values as for smaller kernel sizes

After fixing the kernel size to ℎ = 02 the experimentalevaluation of different combinations of eligibility trace rates120582 and initial step sizes 120578

0are observed Figure 4 shows the

average performance over 10 Monte Carlo runs for 1000trials

All 120582 values with optimal stepsize show good approxima-tion to 119881lowast after 1000 trials Notice that KTD(120582 = 0) showsslightly better performance than KTD(120582 = 1) This may beattributed to the local nature ofKTDwhenusing theGaussiankernel In addition varying the stepsize has a relatively smalleffect on KTD(120582) The Gaussian kernel as well as other shift-invariant kernels provide an implicit normalized update rulewhich is known to be less sensitive to stepsize Based onFigure 4 the optimal eligibility trace rate and initial stepsizevalue 120582 = 06 and 120578

0= 03 are selected for KTD with kernel

size ℎ = 02The learning curve of KTD(120582) is compared to the con-

ventional TD algorithm TD(120582) The optimal parametersemployed in both algorithms are based on the experimentalevaluation In TD(120582) 120582 = 1 and 120578

0= 01 are applied The

RMS error is averaged over 50 Monte Carlo runs for 1000trials Comparative learning curves are given in Figure 5

In this experiment we confirm the ability of KTD(120582) tohandle the function approximation problem when the fixedpolicy yields a state value function that is linear in the staterepresentation Both algorithms reach the mean RMS valueof around 006 As we expected TD(120582) converges faster to theoptimal solution because of the linear nature of the problemKTD(120582) converges slower than TD(120582) but it is also ableto approximate the value function properly In this sense

8 Computational Intelligence and Neuroscience

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 4 Performance comparison over different combinations ofeligibility trace rates 120582 and initial step sizes 120578

0in KTD(120582) with ℎ =

02 The vertical line segments contain the mean RMS value after100 trials (top marker) 500 trials (middle marker) and 1000 trials(bottom marker)

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDTD

Figure 5 Learning curve of KTD(120582) and TD(120582) The solid lineshows the mean RMS error and the dashed line shows the +minusstandard deviations over 50Monte Carlo runs

the KTD algorithm is open to wider class of problems thanits linear counterpart

62 Nonlinear Case Previous section show the performancesof KTD(120582) on the problem of estimating a state value

Start11 10

End

3 2 112 0middot middot middot

[1 0 0 0]

[34 14 0 0]

[12 12 0 0] [0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

minus8

minus8

minus4 minus4

minus4

minus2 minus2

minus2minus2

minus1

minus1

minus05

minus05 minus02

Figure 6 A 13-state Markov chain In states from 2 to 12 each statetransition has probability 05 and state 1 has transition probability1 to the absorbing state 0 Note that optimal state value functionscan be represented as a nonlinear function of the states andcorresponding reward values are assigned to each state

function which is a linear function of the given state repre-sentation The same problem can be turned into a nonlinearone bymodifying the reward values in the chain such that theresulting state value function119881lowast is no longer a linear functionof the states

The number of states and the state representations remainthe same as in the previous section However the optimalvalue function 119881

lowast becomes nonlinear with respect tothe representation of the states namely119881lowast = [0 minus02 minus06

minus 14 minus 3 minus 62 minus 126 minus 134 minus 135 minus 1445 minus 15975

minus 192125 minus 255938] for states 0 to 12 This implies that thereward values for each state are different from the ones givenfor the linear case (Figure 6)

Again to evaluate the performance after each trial iscompleted the estimated state value is compared to theoptimal state value 119881lowast using RMS error (26) For KTD(120582)the Gaussian kernel (28) is applied and kernel size ℎ = 02 ischosen Figure 7 shows the average RMS error over 10MonteCarlo runs for 1000 trials

The combination of 120582 = 04 and 1205780= 03 shows the best

performance but the 120582 = 0 case also shows good perfor-mances Unlike TD(120582) [6] there is no dominant value for 120582in KTD(120582) Recall that it has been proved that convergenceis guaranteed for linearly independent representations of thestates which is automatically fulfilled in KTD(120582) when thekernel is strictly positive definite Therefore the differencesare rather due to the convergence speed controlled by theinteraction between the step size and the elegibilty trace

The average RMS error over 50Monte Carlo runs is com-pared with Gaussian process temporal difference (GPTD)[15] and TD(120582) in Figure 8The purpose of GPTD implemen-tation is to have comparison among kernelized value functionapproximations Here the applied optimal parameters forKTD(120582) are 120582 = 04 120578

0= 03 and ℎ = 02 for GPTD 120582 = 1

1205902= 05 and ℎ = 02 and for TD(120582) 120582 = 08 and 120578

0= 01

The linear function approximation TD(120582) (blue line)cannot estimate the optimal state values KTD(120582) outper-forms the linear algorithm as expected since the Gaussiankernel is strictly positive definite GPTD also learns the targetstate values but the system fails to reach as low error valuesas KTD GPTD is sensitive to the selection of the covariancevalue in the noise1205902 if the value is small the system becomesunstable and larger values cause the the learning to slowdown GPTD models the residuals the difference between

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

8 Computational Intelligence and Neuroscience

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 4 Performance comparison over different combinations ofeligibility trace rates 120582 and initial step sizes 120578

0in KTD(120582) with ℎ =

02 The vertical line segments contain the mean RMS value after100 trials (top marker) 500 trials (middle marker) and 1000 trials(bottom marker)

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDTD

Figure 5 Learning curve of KTD(120582) and TD(120582) The solid lineshows the mean RMS error and the dashed line shows the +minusstandard deviations over 50Monte Carlo runs

the KTD algorithm is open to wider class of problems thanits linear counterpart

62 Nonlinear Case Previous section show the performancesof KTD(120582) on the problem of estimating a state value

Start11 10

End

3 2 112 0middot middot middot

[1 0 0 0]

[34 14 0 0]

[12 12 0 0] [0 0 34 14][0 0 14 34]

[0 0 12 12] [0 0 0 1]

minus8

minus8

minus4 minus4

minus4

minus2 minus2

minus2minus2

minus1

minus1

minus05

minus05 minus02

Figure 6 A 13-state Markov chain In states from 2 to 12 each statetransition has probability 05 and state 1 has transition probability1 to the absorbing state 0 Note that optimal state value functionscan be represented as a nonlinear function of the states andcorresponding reward values are assigned to each state

function which is a linear function of the given state repre-sentation The same problem can be turned into a nonlinearone bymodifying the reward values in the chain such that theresulting state value function119881lowast is no longer a linear functionof the states

The number of states and the state representations remainthe same as in the previous section However the optimalvalue function 119881

lowast becomes nonlinear with respect tothe representation of the states namely119881lowast = [0 minus02 minus06

minus 14 minus 3 minus 62 minus 126 minus 134 minus 135 minus 1445 minus 15975

minus 192125 minus 255938] for states 0 to 12 This implies that thereward values for each state are different from the ones givenfor the linear case (Figure 6)

Again to evaluate the performance after each trial iscompleted the estimated state value is compared to theoptimal state value 119881lowast using RMS error (26) For KTD(120582)the Gaussian kernel (28) is applied and kernel size ℎ = 02 ischosen Figure 7 shows the average RMS error over 10MonteCarlo runs for 1000 trials

The combination of 120582 = 04 and 1205780= 03 shows the best

performance but the 120582 = 0 case also shows good perfor-mances Unlike TD(120582) [6] there is no dominant value for 120582in KTD(120582) Recall that it has been proved that convergenceis guaranteed for linearly independent representations of thestates which is automatically fulfilled in KTD(120582) when thekernel is strictly positive definite Therefore the differencesare rather due to the convergence speed controlled by theinteraction between the step size and the elegibilty trace

The average RMS error over 50Monte Carlo runs is com-pared with Gaussian process temporal difference (GPTD)[15] and TD(120582) in Figure 8The purpose of GPTD implemen-tation is to have comparison among kernelized value functionapproximations Here the applied optimal parameters forKTD(120582) are 120582 = 04 120578

0= 03 and ℎ = 02 for GPTD 120582 = 1

1205902= 05 and ℎ = 02 and for TD(120582) 120582 = 08 and 120578

0= 01

The linear function approximation TD(120582) (blue line)cannot estimate the optimal state values KTD(120582) outper-forms the linear algorithm as expected since the Gaussiankernel is strictly positive definite GPTD also learns the targetstate values but the system fails to reach as low error valuesas KTD GPTD is sensitive to the selection of the covariancevalue in the noise1205902 if the value is small the system becomesunstable and larger values cause the the learning to slowdown GPTD models the residuals the difference between

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

Computational Intelligence and Neuroscience 9

1205780 = 01

1205780 = 02

1205780 = 03

1205780 = 04

1205780 = 05

1205780 = 06

1205780 = 07

1205780 = 08

1205780 = 09

0 02 04 06 08 1

120582

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

05

045

04

035

03

025

02

015

01

005

0

Figure 7 Performance comparison over different combinations of120582 and the initial stepsize 120578

0in KTD(120582) with ℎ = 02 The plotted

segment is the mean RMS value after 100 trials (top segment) 500trials (middle segment) and 1000 trials (bottom segment)

0

1

2

3

Trial number

RMS

erro

r of v

alue

func

tion

over

all s

tate

s

KTDGPTDTD

25

15

05

101 102 103

Figure 8 Learning curves of KTD(120582) GPTD and TD(120582)The solidlines show the mean RMS error and the dashed lines represent the(+minus) standard deviation over 50Monte Carlo runs

expected return and actual return as a Gaussian processThis assumption does not hold true for the Markov chain inFigure 6 As we can observe in Figure 8 KTD(120582) reaches tothe mean value around 007 and the mean value of GPTDand TD(120582) are around 02 and 18 respectively

In the synthetic examples we presented experimentalresults to approximate the state value function under a fixedpolicy We observed that KTD(120582) performs well on bothlinear and nonlinear function approximation problems Inaddition in the previous section we showed how the linearindependence of the input state representations can affectthe performance of algorithms The use of strictly positivedefinite kernels in KTD(120582) implies the linear independencecondition and thus this algorithm converges for all 120582 isin [0 1]In the following section we will apply the extended KTDalgorithm to estimate the action value function which can beemployed in finding a proper control policy for RLBMI tasks

7 Experimental Results on Neural Decoding

In our RLBMI experiments we map the monkeyrsquos neuralsignal to action-directions (computer cursorrobot arm posi-tion) The agent starts at a naive state but the subject hasbeen trained to receive rewards from the environment Onceit reaches the assigned target the system and the subjectearn a reward and the agent updates its neural state decoderThrough iteration the agent learns how to correctly translateneural states into action-directions

71 Open-Loop RLBMI In open-loop RLBMI experimentsthe output of the agent does not directly change the stateof the environment because this is done with prerecordeddata The external device is updated based only on the actualmonkeyrsquos physical response In this sense we only considerthe monkeyrsquos neural state from successful trials to train theagentThe goal of these experiments is to evaluate the systemrsquoscapability to predict the proper state to actionmapping basedon the monkeyrsquos neural states and to assess the viability offurther closed-loop experiments

711 Environment The data employed in these experimentsis provided by SUNY Downstate Medical Center A femalebonnet macaque is trained for a center-out reaching taskallowing 8 action-directions After the subject attains about80 success rate microelectrode arrays are implanted inthe motor cortex (M1) Animal surgery is performed underthe Institutional Animal Care and Use Committee (IACUC)regulations and assisted by theDivision of LaboratoryAnimalResources (DLAT) at SUNY Downstate Medical Center

From 96-channel recordings a set of 185 units areobtained after sorting The neural states are represented bythe firing rates of each unit on 100ms window There is a setof 8 possible targets and action directions Every trial startsat the center point and the distance from the center to eachtarget is 4 cm anythingwithin a radius of 1 cm from the targetpoint is considered as a valid reach

712 Agent In the agent 119876-learning via kernel temporaldifference (119876-KTD)(120582) is applied to neural decoding For 119876-KTD(120582) we employ theGaussian kernel (28) After the neuralstates are preprocessed by normalizing their dynamic rangeto lie between minus1 and 1 they are input to the system Basedon the preprocessed neural states the system predicts which

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

10 Computational Intelligence and Neuroscience

Table 1 Average success rates of 119876-KTD in open-loop RLBMI

Epochs 1 2 3 4 5 6 72 target 044 096 099 099 097 099 0994 target 041 073 076 095 099 099 0998 target 032 065 079 089 096 098 098

direction the computer cursor will move Each output unitrepresents one of the 8 possible directions and among the 8outputs one action is selected by the 120598-greedy method [34]The action corresponding to the unit with the highest119876 valuegets selected with probability 1 minus 120598 Otherwise any otheraction is selected at randomThe performance is evaluated bychecking whether the updated position reaches the assignedtarget and depending on the updated position a reward valueis assigned to the system

713 Results on Single Step Tasks Here the targets should bereached within a single step rewards from the environmentare received after a single step and one action is performedby the agent per trial The assignment of reward is based onthe 1-0 distance to the target that is dist(119909 119889) = 0 if 119909 = 119889and dist(119909 119889) = 1 otherwise Once the cursor reaches theassigned target the agent gets a positive reward +06 else itreceives negative reward minus06 [35] Exploration rate 120598 = 001and discount factor 120574 = 09 are applied Also we consider 120582 =0 since our experiment performs single step updates per trialIn this experiment the firing rates of the 185 units on 100mswindows are time-embedded using 6th order tap delay Thiscreates a representation spacewhere each state is a vectorwith1295 dimensions

We start with the simplest version of the problem byconsidering only 2-targets (right and left) The total numberof trials is 43 for the 2-targets For 119876-KTD the kernel size ℎis heuristically chosen based on the distribution of the meansquared distances between pairs of input states let 119904 = 119864[119909

119894minus

1199091198952] then ℎ = radic1199042 For this particular data set the above

heuristic gives a kernel size ℎ = 7 The stepsize 120578 = 03 isselected based on the stability bound that was derived for thekernel least mean square algorithm [25]

120578 lt119873

tr [119866120601]=

119873

sum119873

119895=1120581 (119909 (119895) 119909 (119895))

= 1 (29)

where 119866120601is the gram matrix After 43 trials we count the

number of trials which received a positive reward and thesuccess rate is averaged over 50 Monte Carlo runs Theperformance of the 119876-KTD algorithm is compared with 119876-learning via time delayed neural net (119876-TDNN) and theonline selective kernel-based temporal difference learningalgorithm (119876-OSKTD) [23] in Figure 9 Note that TDNNis a conventional approach to function approximation andhas already been applied to RLBMI experiments for neuraldecoding [1 2] OSKTD is a kernel-based temporal differencealgorithm emphasizing on the online sparsifications

Both 119876-KTD and 119876-OSKTD reach around 100 successrate after 2 epochs In contrast the average success rateof 119876-TDNN slowly increases yet never reaches the same

0 5 10 15 200

01

02

03

04

05

06

07

08

09

1

Epochs

Succ

ess r

ates

Q-TDNNQ-OSKTDQ-KTD

Figure 9 The comparison of average learning curves from 50

Monte Carlo runs among 119876-TDNN 119876-OSKTD and 119876-KTD Solidlines show the mean success rates and the dashed lines show theconfidence interval based on one standard deviation

performance as 119876-KTD In the case of 119876-OSKTD the valuefunction updates require one more parameter 120583

2to decide

the subspace To validate the algorithmrsquos capability to estimateproper policy we set the sparsified dictionary as the samesize as the number of sample observations In 119876-OSKTDwe observed that the subspace selection parameter plays animportant role in terms of the speed of learning It turns outthat for the above experiment smaller subspaces allow fasterlearning In the extreme case of 119876-OSKTD where only thecurrent state is affected the updates become equivalent to theupdate rule of 119876-KTD

Since all the experimental parameters are fixed over 50Monte Carlo runs the confidence interval for 119876-KTD canbe simply associated with the random effects introducedby the 120598-greedy method employed for action selection withexploration thus the narrow interval However with 119876-TDNN a larger variation of performance is observed whichshows how the initialization due to local minima influencesthe success of learning it is observed that 119876-TDNN is ableto approximate the 119876-KTD performance but most of thetimes the system falls on local minima This highlights oneof the advantages of KTD compared to TDNN which is theinsensitivity to initialization

Table 1 shows average success rates over 50 Monte Carloruns with respect to different number of targets The first

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

Computational Intelligence and Neuroscience 11

0 5 10 15 20 25 30 35 40 450

1

Final filter size

Aver

age s

ucce

ss ra

tes

01

02

03

04

05

06

07

08

09

Figure 10 Average success rates over 50 Monte Carlo runs withrespect to different filter sizes The vertical line segments are themean success rates after 1 epoch (bottommarkers) 2 epochs (middlemarkers) and 20 epochs (top markers)

row corresponds to the mean success rates displayed onFigure 9 (red solid line)This is included in the Table 1 to easecomparisonwith 4 and 8-target experimentsThe 4 target taskinvolves reaching right up left and down positions from thecenter Note that in all tasks 8 directions are allowed at eachstep The standard deviation of each epoch is around 002

One characteristic of nonparametric approaches is thegrowing filter structure Here we observe how filter sizeinfluences the overall performance in 119876-KTD by applyingSurprise criterion [32] and Quantization [21] methods Inthe case of the 2-target center-out reaching task we shouldexpect the filter size to become as large as 861 units after20 epochs without any control of the filter size Using theSurprise criterion the filter size can be reduced to 87 centerswith acceptable performance However Quantization allowsthe filter size to be reduced to 10 units while maintainingperformance above 90 for success rates Figure 10 showsthe effect of filter size in the 2-target experiment usingthe Quantization approach For filter sizes as small as 10units the average success rates remain stable With 10 unitsthe algorithm shows similar learning speed to the linearlygrowing filter size with success rates above 90 Note thatquantization limits the capacity of the kernel filter since lessunits than samples are employed and thus it helps to avoidover-fitting

In the 2-target center-out reaching task quantized 119876-KTD shows satisfactory results in terms of initialization andcomputational cost Further analysis of 119876-KTD is conductedon a larger number of targets We increase the number oftargets from 2 to 8 All experimental parameters are keptthe same as for the 2-target experiment The only change isstep-size 120578 = 05 The 178 trials are applied for the 8-targetreaching task

To gain more insight about the algorithm we observethe interplay between Quantization size 120598

119880and kernel size ℎ

Based on the distribution of squared distances between pairs

0 1 2 3 4 5 6 70

1

Kernel sizes

Succ

ess r

ates

01

02

03

04

05

06

07

08

09

Final filter size = 178

Final filter size = 133

Final filter size = 87

Final filter size = 32

Figure 11 The effect of filter size control on 8-target single-stepcenter-out reaching task The average success rates are computedover 50Monte Carlo runs after the 10th epoch

of input states various kernel sizes (ℎ = 05 1 15 2 3 5 7)andQuantization sizes (120598

119880= 1 110 120 130) are considered

The corresponding success rates for final filter sizes of 178133 87 and 32 are displayed in Figure 11

With a final filter size of 178 (blue line) the success ratesare superior to any other filter sizes for every kernel sizestested since it contains all input information Especially forsmall kernel sizes (ℎ le 2) success rates above 96 areobservedMoreover note that even after reduction of the stateinformation (red line) the system still produces acceptablesuccess rates for kernel sizes ranging from 05 to 2 (around90 success rates)

Among the best performing kernel sizes we favor thelargest one since it provides better generalization guaranteesIn this sense a kernel size ℎ = 2 can be selected since this isthe largest kernel size that considerably reduces the filter sizeand yields a neural state to actionmapping that performs well(around 90 of success rates) In the case of kernel size ℎ = 2with final filter size of 178 the system reaches 100 successrates after 6 epochs with a maximum variance of 4 Aswe can see from the number of units higher representationcapacity is required to obtain the desired performance as thetask becomes more complex Nevertheless results on the 8-target center-out reaching task show that the method caneffectively learn the brain state-action mapping for this taskwith a reasonable complexity

714 Results on Multistep Tasks Here we develop a morerealistic scenario we extend the task to multistep and mul-titarget experiments This case allows us to explore the roleof the eligibility traces in 119876-KTD(120582) The price paid for thisextension is that now the eligibility trace rate 120582 selectionneeds to be carried out according to the best observedperformance Testing based on the same experimental set

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

12 Computational Intelligence and Neuroscience

0 1 2 3 4 518

19

20

21

22

23

24

25

26

27

28

02

03

04

05

06

07

08

09

minus1minus2minus3minus4minus5

X

Y

minus06

Figure 12 Reward distribution for right target The black diamondis the initial position and the purple diamond shows the possibledirections including the assigned target direction (red diamond)

up employed for the single step task that is a discretereward value is assigned at the target causes extremely slowlearning since not enough guidance is given The systemrequires long periods of exploration until it actually reachesthe target Therefore we employ a continuous reward distri-bution around the selected target defined by the followingexpression

119903 (119904) =

119901reward119866 (119904) if 119866 (119904) gt 01

119899reward if 119866 (119904) le 01(30)

where119866(119904) = exp[(119904minus120583)⊤Cminus1120579(119904minus120583)] 119904 isin R2 is the position of

the cursor 119901reward = 1 and 119899reward = minus06 The mean vector 120583corresponds to the selected target location and the covariancematrix

C120579= R120579(75 0

0 01)R⊤120579 R

120579= (

cos 120579 sin 120579minus sin 120579 cos 120579

) (31)

which depends on the angle 120579 of the selected target as followsfor target index one and five the angle is 0 two and six are forminus1205874 three and seven are for 1205872 and four and eight are for1205874 (Here the target indexes follow the location depicted onFigure 6 in [22]) Figure 12 shows the reward distribution fortarget index one The same form of distribution is applied tothe other directions centred at the assigned target point

Once the system reaches the assigned target the systemearns a maximum reward of +1 and receives partial rewardsaccording to (30) during the approaching stage When thesystem earns the maximum reward the trial is classified asa successful trial The maximum number of steps per trialis limited such that the cursor must approach the target in astraight line trajectory Here we also control the complexityof the task by allowing different number of targets and stepsNamely 2-step 4-target (right up left and down) and 4-step

3-target (right up and down) experiments are performedIncreasing the number of steps per trial amounts to makingsmaller jumps according to each action After each epochthe number of successful trials is counted for each targetdirection Figure 13 shows the learning curves for each targetand the average success rates

Larger number of steps results in lower success ratesHowever the two cases (two and four steps) obtain anaverage success rate above 60for 1 epochTheperformancesshow all directions can achieve success rates above 70after convergence which encourage the application of thealgorithm to online scenarios

72 Closed-Loop RLBMI Experiments In closed loop RLBMIexperiments the behavioral task is a reaching task using arobotic arm The decoder controls the robot armrsquos actiondirection by predicting the monkeyrsquos intent based on itsneuronal activity If the robot arm reaches the assigned targeta reward is given to both the monkey (food reward) andthe decoder (positive value) Notice that the two intelligentsystems learn coadaptively to accomplish the goal Theseexperiments are conducted in cooperation with the Neu-roprosthetics Research Group at the University of MiamiThe performance is evaluated in terms of task completionaccuracy and speed Furthermore we provide amethodologyto tease apart the influence of each one of the systems of theRLBMI in the overall performance

721 Environment During pretraining a marmoset monkeywas trained to perform a target reaching task namelymovinga robot arm to two spatial locations denoted as A trial and Btrial The monkey was taught to associate changes in motoractivity during A trials and produce static motor responsesduring B trials Once a target is assigned a beep signalsthe start of the trial To control the robot during the usertraining phase the monkey is required to steadily place itshand on a touch pad for 700sim1200msThis action produces ago beep that is followed by the activation of one of the twotarget LEDs (A trial red light for left direction or B trialgreen light for right direction)The robot arm goes to a homeposition namely the center position between the two targetsIts gripper shows an object (food reward such as waxwormor marshmallow for A trial and undesirable object (woodenbead) for B trial) To move the robot to the A locationthe monkey needed to reach out and touch a sensor within2000ms and to make the robot reach to the B target themonkey needed to keep its arm motionless on the touch padfor 2500msWhen the monkey successfully moved the robotto the correct target the target LEDs would blink and themonkey would receive a food reward (for both the A and Btargets)

After the monkey is trained to perform the assignedtask properly a microelectrode array (16-channel tungstenmicroelectrode arrays Tucker Davis Technologies FL) issurgically implanted under isoflurane anesthesia and sterileconditions Neural states from the motor cortex (M1) arerecorded These neural states become the inputs to theneural decoder All surgical and animal care procedures were

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

Computational Intelligence and Neuroscience 13

0 2 4 6 8 100

1

Epochs

AverageRightUp

LeftDown

01

02

03

04

05

06

07

08

09

Succ

ess r

ates

(a) 2-step 4-target

0 2 4 6 8 100

1

Epochs

Succ

ess r

ates

AverageRight Up

Down

01

02

03

04

05

06

07

08

09

(b) 4-step 3-target

Figure 13 The learning curves for multistep multitarget tasks

consistent with the National Research Council Guide for theCare and Use of Laboratory Animals and were approved bythe University of Miami Institutional Animal Care and UseCommittee

In the closed-loop experiments after the initial holdingtime that produces the go beep the robotic armrsquos positionis updated based solely on the monkeyrsquos neural statesDifferently from the user pretraining sessions the monkeyis not required to perform any movement During the real-time experiment 14 neurons are obtained from 10 electrodesThe neural states are represented by the firing rates on a 2 secwindow following the go signal

722 Agent For the BMI decoder we use 119876-learning viakernel Temporal Differences (119876-KTD)(120582) One big differ-ence between open-loop and closed-loop applications is theamount of accessible data in the closed-loop experiments wecan only get information about the neural states that havebeen observed up to the present However in the previousoffline experiments normalization and kernel selection wereconducted offline based on the entire data set It is notpossible to apply the same method to the online settingsince we only have information about the input states upto the present time Normalization is a scaling procedurethat interplays with the choice of the kernel size Properselection of the kernel size brings proper scaling to the dataThus in contrast to the previous open-loop experimentsnormalization of the input neural states is not applied andthe kernel size is automatically selected given the inputs

The Gaussian kernel (28) is employed and the kernel sizeℎ is automatically selected based on the history of inputsNotethat in the closed-loop experiments the dynamic range ofstates varies from experiment to experiment Consequently

the kernel size needs to be re-adjusted each time a new exper-iment takes place and it cannot be determined beforehandAt each time the distances between the current state and thepreviously observed states are computed to obtain the outputvalues119876 in this caseTherefore we use the distance values toselect the kernel size as follows

ℎtemp (119899) = radic1

2 (119899 minus 1)

119899minus1

sum

119894=1

119909 (119894) minus 119909 (119899)2

ℎ (119899) =1

119899[

119899minus1

sum

119894=1

ℎ (119894) + ℎtemp (119899)]

(32)

Using the squared distance between pairs of previously seeninput states we can obtain an estimate of the mean distanceThis value is also averaged along with past kernel sizes toobtain the current kernel size

Moreover we consider 120574 = 1 and 120582 = 0 since ourexperiments perform single step trials Stepsize 120578 = 05 isapplied The output represents the 2 possible directions (leftand right) and the robot arm moves based on the estimatedoutput from the decoder

723 Results Theoverall performance is evaluated by check-ing whether the robot arm reaches the assigned target Oncethe robot arm reaches the target the decoder gets a positivereward +1 otherwise it receives negative reward minus1

Table 2 shows the decoder performance over 4 days interms of success rates Each day corresponds to a separateexperiment In Day 1 the experiment has a total of 20 trials(10A trials and 10 B trials)The overall success rate was 90Only the first trial for each target was incorrectly assigned

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

14 Computational Intelligence and Neuroscience

0 5 10 15 20

0

1

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

TD er

ror

A trialB trial

A trialB trial

A trialB trial

0 5 10 15 20

0

1

Trial numbers

0

1

0 10 20 30 40 50

0

1

TD er

ror

0 10 20 30 40 50

0

1

Trial numbers

Trial numbers Trial numbers

Trial numbers Trial numbers0 10 20 30 40 50

minus1

minus1

minus1

minus1

minus1

Qva

lue

Qva

lue 05

minus05

minus2S(1)F

(minus1)

inde

x

S(1)F

(minus1)

inde

x

Figure 14 Performance of 119876-learning via KTD in the closed loop RLBMI controlled by a monkey for Day 1 (left) and Day 3 (right) thesuccess (+1) index and failure (minus1) index of each trial (top) the change of TD error (middle) and the change of 119876-values (down)

Table 2 Success rates of 119876-KTD in closed-loop RLBMI

Total trial numbers(total A B trial)

Success rates()

Day 1 20 (10 10) 9000Day 2 32 (26 26) 8438Day 3 53 (37 36) 7736Day 4 52 (37 35) 7885

Note that at each day the same experimental set upwas utilized The decoder was initialized in the same way ateach day We did not use pretrained parameters to initializethe system To understand the variation of the success ratesacross days we look at the performance of Day 1 and

Day 3 Figure 14 shows the decoder performance for the 2experiments

Although the success rate for Day 3 is not as high asDay 1 both experiments show that the algorithm learns anappropriate neural state to action map Even though thereis variation among the neural states within each day thedecoder adapts well to minimize the TD error and the 119876-values converge to the desired values for each action Becausethis is a single step task and the reward +1 is assigned for asuccessful trial it is desired for the estimated action value 119876to be close to +1

It is observed that the TD error and 119876-values oscillateThe drastic change of TD error or119876-value corresponds to themissed trials The overall performance can be evaluated bychecking whether the robot arm reaches the desired target

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

Computational Intelligence and Neuroscience 15

0 20 40 60 80

0

20

First component

Seco

nd co

mpo

nent minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(a) After 3 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(b) After 3 trials

First component

Seco

nd co

mpo

nent

0 20 40 60 80

0

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

(c) After 10 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

(d) After 30 trials

0

First component

Seco

nd co

mpo

nent

0 20 40 60 80

20

minus20

minus40

minus60

minus20minus40minus60

minus80

minus100

minus120

PolicyA trialB trial

(e) After 20 trials

0

0

50 100 150 200

20

First component

Seco

nd co

mpo

nent

minus20

minus40

minus60

minus80

minus100

minus120

minus140

minus160

minus180

minus50

PolicyA trialB trial

(f) After 57 trials

Figure 15 Estimated policy for the projected neural states from Day 1 (left) and Day 3 (right) The failed trials during the closed loopexperiment are marked as red stars (missed A trials) and green dots (missed B trials)

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

16 Computational Intelligence and Neuroscience

(the top plots in Figure 14) However this assessment doesnot show what causes the change in the system values Inaddition it is hard to know how the two separate intelligentsystems interact during learning and how neural states affectthe overall performance

Under the coadaptation scenario in the RLBMI archi-tecture it is obvious that if one system does not performproperly it will cause detrimental effects on the performanceof the other system If the BMI decoder does not giveproper updates to the robotic device it will confuse the userconducting the task and if the user gives improper stateinformation or the translation is wrong the resulting updatemay fail even though the BMI decoder was able to find theoptimal mapping function

Using the proposed methodology introduced in [36] wecan observe how the decoder effectively learns a good state toaction mapping and how neural states affect the predictionperformance Figure 15 shows how each participant (theagent and the user) influences the overall performance inboth successful and missed trials and how the agent adaptsthe environment By applying principal component analysis(PCA) the high-dimensional neural states can be visualizedin two dimensions using the first two largest principalcomponents In this two-dimensional space of projectedneural states we can visualize the estimated policy as well

We observe the behavior of two systems at the beginningintermediate and final stages of the experiment by usingthe neural states that have been observed as well as thelearned decoder up to the given stage It is evident that thedecoder can predict nonlinear policies Day 1 (left columnin Figure 15) shows that the neural states from the twoclasses are well separable It was noted during Day 3 thatthe monkey seemed less engaged in the task than in Day1 This suggests the possibility that during some trials themonkey was distracted and may not have been producing aconsistent set of neural outputs We are also able to see thisphenomenon from the plots (right column in Figure 15) Wecan see that most of the neural states that were misclassifiedappear to be closer to the states corresponding to the oppositetarget in the projected state space However the estimatedpolicy shows that the system effectively learns Note that theinitially misclassified A trials (red stars in Figure 15(d) whichare located near the estimated policy boundary) are assignedto the right direction when learning has been accomplished(Figure 15(f)) It is a remarkable fact that the system adapts tothe environment online

8 Conclusions

The advantages of KTD(120582) in neural decoding problems wereobserved The key observations of this kernel-based learningalgorithm are its capabilities for nonlinear function approx-imation and its convergence guarantees We also examinedthe capability of the extended KTD algorithm (119876-KTD(120582))in both open-loop and closed-loop reinforcement learningbrain machine interface (RLBMI) experiments to performreaching tasks

In open-loop experiments results showed that 119876-KTD(120582) can effectively learn the brain state-action mappingand offer performance advantages over conventional non-linear function approximation methods such as time-delayneural nets We observed that 119876-KTD(120582) overcomes mainissues of conventional nonlinear function approximationmethods such as local minima and proper initialization

Results on closed-loop RLBMI experiments showed thatthe algorithm succeeds in finding a proper mapping betweenneural states and desired actions Its advantages are that itdoes not depend on the initialization neither require anyprior information about input states Also parameters canbe chosen on the fly based on the observed input statesMoreover we observed how the two intelligent systems coa-daptively learn in an online reaching taskThe results showedthat KTD is powerful for practical applications due to itsnonlinear approximation capabilities in online learning

The observation and analysis of KTD(120582) give us a basicidea of how this algorithm behaves However in the caseof 119876-KTD(120582) the convergence analysis remains challengingsince 119876-learning contains both a learning policy and agreedy policy For 119876-KTD(120582) the convergence proof for119876-learning using temporal difference (TD)(120582) with linearfunction approximation in [37] can provide a basic intuitionfor the role of function approximation on the convergence of119876-learning

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is partially supported byDARPAContractN66001-10-C-2008 The authors would like to thank Pratik Chhatbarand Brandi Marsh for collecting the center-out reaching taskdata for the open loop experiments

References

[1] J DiGiovanna B Mahmoudi J Fortes J C Principe and JC Sanchez ldquoCoadaptive brain-machine interface via reinforce-ment learningrdquo IEEE Transactions on Biomedical Engineeringvol 56 no 1 pp 54ndash64 2009

[2] BMahmoudi Integrating robotic actionwith biologic perceptiona brainmachine symbiosis theory [PhD dissertation] Universityof Florida Gainesville Fla USA 2010

[3] E A Pohlmeyer B Mahmoudi S Geng N W Prins and J CSanchez ldquoUsing reinforcement learning to provide stable brain-machine interface control despite neural input reorganizationrdquoPLoS ONE vol 9 no 1 Article ID e87253 2014

[4] S Matsuzaki Y Shiina and Y Wada ldquoAdaptive classificationfor brainmachine interface with reinforcement learningrdquo inProceedings of the 18th International Conference on NeuralInformation Processing vol 7062 pp 360ndash369 Shanghai ChinaNovember 2011

[5] M J Bryan S A Martin W Cheung and R P N RaoldquoProbabilistic co-adaptive brain-computer interfacingrdquo Journalof Neural Engineering vol 10 no 6 Article ID 066008 2013

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008

Computational Intelligence and Neuroscience 17

[6] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquoMachine Learning vol 3 no 1 pp 9ndash44 1988

[7] J A Boyan Learning evaluation functions for global optimiza-tion [PhD dissertation] Carnegie Mellon University 1998

[8] S J Bradtke and A G Barto ldquoLinear least-squares algorithmsfor temporal difference learningrdquoMachine Learning vol 22 pp33ndash57 1996

[9] A Geramifard M Bowling M Zinkevich and R S Suttonldquoilstd eligibility traces and convergence analysisrdquo in Advancesin Neural Information Processing Systems pp 441ndash448 2007

[10] R S Sutton C Szepesvari and H R Maei ldquoA convergentO(n) algorithm for off-policy temporal-difference learningwithlinear function approximationrdquo in Proceedings of the 22ndAnnual Conference on Neural Information Processing Systems(NIPS rsquo08) pp 1609ndash1616 MIT Press December 2008

[11] R S Sutton H R Maei D Precup et al ldquoFast gradient-descent methods for temporal-difference learning with linearfunction approximationrdquo in Proceeding of the 26th InternationalConference On Machine Learning (ICML rsquo09) pp 993ndash1000June 2009

[12] J N Tsitsiklis and B Van Roy ldquoAn analysis of temporal-difference learning with function approximationrdquo IEEE Trans-actions on Automatic Control vol 42 no 5 pp 674ndash690 1997

[13] S Haykin Neural Networks and Learning Machines PrenticeHall 2009

[14] B Scholkopf and A J Smola Learning with Kernels MIT Press2002

[15] Y EngelAlgorithms and representations for reinforcement learn-ing [PhD dissertation] Hebrew University 2005

[16] X Xu T Xie D Hu and X Lu ldquoKernel least-squares temporaldifference learningrdquo International Journal of Information Tech-nology vol 11 no 9 pp 54ndash63 2005

[17] J Bae P Chhatbar J T Francis J C Sanchez and J C PrincipeldquoReinforcement learning via kernel temporal differencerdquo inProceedings of the 33rd Annual International Conference of theIEEE onEngineering inMedicine andBiology Society (EMBC 11)pp 5662ndash5665 2011

[18] S Zhao From fixed to adaptive budget robust kernel adaptivefiltering [PhD dissertation] University of Florida GainesvilleFla USA 2012

[19] Y Engel S Mannor and R Meir ldquoThe kernel recursive least-squares algorithmrdquo IEEE Transactions on Signal Processing vol52 no 8 pp 2275ndash2285 2004

[20] X Xu ldquoA sparse kernel-based least-squares temporal differencealgorithms for reinforcement learningrdquo inProceedings of the 2ndInternational Conference on Natural Computation vol 4221 pp47ndash56 2006

[21] B Chen S Zhao P Zhu and J C Principe ldquoQuantized kernelleast mean square algorithmrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 23 no 1 pp 22ndash32 2012

[22] J Bae L S Giraldo P Chhatbar J T Francis J C Sanchezand J C Principe ldquoStochastic kernel temporal difference forreinforcement learningrdquo in Proceedings of the 21st IEEE Inter-national Workshop on Machine Learning for Signal Processing(MLSP rsquo11) pp 1ndash6 IEEE September 2011

[23] X Chen Y Gao and R Wang ldquoOnline selective kernel-basedtemporal difference learningrdquo IEEE Transactions on NeuralNetworks and Learning Systems vol 24 no 12 pp 1944ndash19562013

[24] R S Rao and A G Barto Reinforcement Learning An Introduc-tion MIT Press New York NY USA 1998

[25] W Liu J C Principe and S Haykin Kernel Adaptive FilteringA Comprehensive Introduction Wiley 2010

[26] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society A Mathematical Physical andEngineering Sciences vol 209 pp 415ndash446 1909

[27] P Dayan and T J Sejnowski ldquoTD(120582) converges with probability1rdquoMachine Learning vol 14 no 3 pp 295ndash301 1994

[28] H J Kushner andD S Clark Stochastic ApproximationMethodsfor Constrained and Unconstrained Systems vol 26 of AppliedMathematical Sciences Springer New York NY USA 1978

[29] C J C H Watkins Learning from delayed rewards [PhDdissertation] Kingrsquos College London UK 1989

[30] C Szepesvari Algorithms for Reinforcement Learning edited byR J Branchman and T Dietterich Morgan amp Slaypool 2010

[31] S Zhao B Chen P Zhu and J C Prıncipe ldquoFixed budgetquantized kernel least-mean-square algorithmrdquo Signal Process-ing vol 93 no 9 pp 2759ndash2770 2013

[32] W Liu I Park and J C Prıncipe ldquoAn information theoreticapproach of designing sparse kernel adaptive filtersrdquo IEEETransactions on Neural Networks vol 20 no 12 pp 1950ndash19612009

[33] J A Boyan ldquoTechnical update least-squares temporal differ-ence learningrdquoMachine Learning vol 49 pp 233ndash246 2002

[34] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[35] J C Sanchez A Tarigoppula J S Choi et al ldquoControl of acenter-out reaching task using a reinforcement learning Brain-Machine Interfacerdquo in Proceedings of the 5th InternationalIEEEEMBS Conference on Neural Engineering (NER rsquo11) pp525ndash528 May 2011

[36] J Bae L G Sanchez Giraldo E A Pohlmeyer J C Sanchezand J C Principe ldquoA new method of concurrently visualizingstates values and actions in reinforcement based brainmachineinterfacesrdquo in Proceedings of the 35th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo13) pp 5402ndash5405 July 2013

[37] F S Melo S P Meyn and M I Ribeiro ldquoAn analysisof reinforcement learning with function approximationrdquo inProceedings of the 25th International Conference on MachineLearning pp 664ndash671 July 2008


Recommended