+ All Categories
Home > Documents > Training Spiking ConvNets by STDP and Gradient...

Training Spiking ConvNets by STDP and Gradient...

Date post: 21-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
8
Training Spiking ConvNets by STDP and Gradient Descent Abstract—This paper proposes a new method for training multi-layer spiking convolutional neural networks (CNNs). Train- ing a multi-layer spiking network poses difficulties because the output spikes do not have derivatives and the commonly used backpropagation method for non-spiking networks is not easily applied. Our method uses a novel version of layered spike-timing- dependent plasticity (STDP) that incorporates supervised and unsupervised components. Our method starts with conventional learning methods and converts them to spatio-temporally local rules suited for spiking neural networks (SNNs). The training process uses two components for unsupervised feature extraction and supervised classification. The first com- ponent is a new STDP rule for spike-based representation learning which trains convolutional filters. The second introduces a new STDP-based supervised learning rule for spike pattern classification via an approximation to gradient descent. Stacking these components implements a novel spiking CNN of integrate- and-fire (IF) neurons with performances comparable with the state-of-the-art deep SNNs. The experimental results show the success of the proposed model for the MNIST handwritten digit classification. Our network architecture is the only high performance, spiking CNN which provides bio-inspired STDP rules in a hierarchy of feature extraction and classification in an entirely spike-based framework. I. I NTRODUCTION Spiking neural networks (SNNs) and their adaptive synapses contribute to a new approach in cognition, decision making, and learning [1], [2], [3]. Information in SNNs is transferred by action potentials or spike trains released from each neuron. This study introduces a novel approach to train multi-layer spiking convolutional neural networks (CNNs). Training a multi-layer spiking network is difficult because the discrete spikes released from spiking neurons do not have derivatives. Thus, the popular backpropagation algorithm for non-spiking networks is not easily applied. Our method uses a novel version of layered spike-timing-dependent plasticity (STDP) that incorporates supervised and unsupervised components. Our method starts with conventional learning methods and converts them to spatio-temporally local rules suited for SNNs. By spatio-temporally local we mean the following. Local in time means that the information to modify the synapse is recent, say within 4 msec (or less) of the postsynaptic spike that triggers the STDP. By local in space, we mean that the information used to modify the synaptic weight is, in principle, available at the presynaptic terminal and the postsynaptic cell membrane. These constraints are intended to improve the biological plausibility of the model. Although spiking networks are considered in theory to have higher computational power than rate-coded networks [4], [5], it remains a challenge to train them, especially with multi-layer learning. There are few spiking networks with multiple trainable layers although the brain’s ability to train such networks clearly demonstrates the viability of this type of architecture. Most previous studies are limited in having only one trainable layer. Specifically, the first ‘deep’ spiking network had only one layer of unsupervised learning, [6], along with a supervised readout layer. More recent work has used large networks to tackle larger problems but is still restricted to one unsupervised learning layer [7], [8]. Using current literature, a deep spiking network can be implemented either by converting trained rate-coded neural networks to a spiking platform (ANN-to-SNN) [9], [10], [11], [12], [13], [14], [15] or by using spike-based, online learning methods. The former approach avoids the problem of training an SNN and instead trains a conventional neural network. The latter approach is reflected in very recent studies. Such work has built deep SNNs either by developing bio-inspired, layer- wise learning rules [16], [17], [18] or by using stochastic gradient descent [19], [20], [21], [22] which takes advantage of a neural activation function approximation of the spiking neuron’s membrane potential. The above-mentioned gradient descent studies used the neural membrane potential as a substitute for an activation function. Unlike those studies, our method for supervised learning is based on the idea that an integrate-and-fire (IF) neuron can approximate a rectified linear unit (ReLU) acti- vation function [23]. This enables us to perform approximate gradient descent on the output layer and on any fully connected hidden layers within our architecture. We also incorporate unsupervised learning for feature dis- covery in the convolutional layers. Our previous work [24], showed that a hierarchy of representation learning and STDP rules can extract visual features for classification. The limita- tions of the previous work were: 1) the representation learning was not temporally local; and, 2) although the network archi- 978-1-5090-6014-6/18/$31.00 ©2018 IEEE
Transcript
Page 1: Training Spiking ConvNets by STDP and Gradient Descentstatic.tongtianta.site/paper_pdf/94fdc5ba-5b60-11e... · tion learning component are used for the convolutional layers. The supervised

Training Spiking ConvNets by STDP and GradientDescent

Abstract—This paper proposes a new method for trainingmulti-layer spiking convolutional neural networks (CNNs). Train-ing a multi-layer spiking network poses difficulties because theoutput spikes do not have derivatives and the commonly usedbackpropagation method for non-spiking networks is not easilyapplied. Our method uses a novel version of layered spike-timing-dependent plasticity (STDP) that incorporates supervised andunsupervised components. Our method starts with conventionallearning methods and converts them to spatio-temporally localrules suited for spiking neural networks (SNNs).

The training process uses two components for unsupervisedfeature extraction and supervised classification. The first com-ponent is a new STDP rule for spike-based representationlearning which trains convolutional filters. The second introducesa new STDP-based supervised learning rule for spike patternclassification via an approximation to gradient descent. Stackingthese components implements a novel spiking CNN of integrate-and-fire (IF) neurons with performances comparable with thestate-of-the-art deep SNNs. The experimental results show thesuccess of the proposed model for the MNIST handwrittendigit classification. Our network architecture is the only highperformance, spiking CNN which provides bio-inspired STDPrules in a hierarchy of feature extraction and classification in anentirely spike-based framework.

I. INTRODUCTION

Spiking neural networks (SNNs) and their adaptive synapsescontribute to a new approach in cognition, decision making,and learning [1], [2], [3]. Information in SNNs is transferredby action potentials or spike trains released from each neuron.

This study introduces a novel approach to train multi-layerspiking convolutional neural networks (CNNs). Training amulti-layer spiking network is difficult because the discretespikes released from spiking neurons do not have derivatives.Thus, the popular backpropagation algorithm for non-spikingnetworks is not easily applied. Our method uses a novelversion of layered spike-timing-dependent plasticity (STDP)that incorporates supervised and unsupervised components.Our method starts with conventional learning methods andconverts them to spatio-temporally local rules suited for SNNs.

By spatio-temporally local we mean the following. Localin time means that the information to modify the synapse isrecent, say within 4 msec (or less) of the postsynaptic spikethat triggers the STDP. By local in space, we mean that theinformation used to modify the synaptic weight is, in principle,

available at the presynaptic terminal and the postsynaptic cellmembrane. These constraints are intended to improve thebiological plausibility of the model.

Although spiking networks are considered in theory to havehigher computational power than rate-coded networks [4],[5], it remains a challenge to train them, especially withmulti-layer learning. There are few spiking networks withmultiple trainable layers although the brain’s ability to trainsuch networks clearly demonstrates the viability of this typeof architecture. Most previous studies are limited in havingonly one trainable layer. Specifically, the first ‘deep’ spikingnetwork had only one layer of unsupervised learning, [6],along with a supervised readout layer. More recent work hasused large networks to tackle larger problems but is stillrestricted to one unsupervised learning layer [7], [8].

Using current literature, a deep spiking network can beimplemented either by converting trained rate-coded neuralnetworks to a spiking platform (ANN-to-SNN) [9], [10], [11],[12], [13], [14], [15] or by using spike-based, online learningmethods. The former approach avoids the problem of trainingan SNN and instead trains a conventional neural network. Thelatter approach is reflected in very recent studies. Such workhas built deep SNNs either by developing bio-inspired, layer-wise learning rules [16], [17], [18] or by using stochasticgradient descent [19], [20], [21], [22] which takes advantageof a neural activation function approximation of the spikingneuron’s membrane potential.

The above-mentioned gradient descent studies used theneural membrane potential as a substitute for an activationfunction. Unlike those studies, our method for supervisedlearning is based on the idea that an integrate-and-fire (IF)neuron can approximate a rectified linear unit (ReLU) acti-vation function [23]. This enables us to perform approximategradient descent on the output layer and on any fully connectedhidden layers within our architecture.

We also incorporate unsupervised learning for feature dis-covery in the convolutional layers. Our previous work [24],showed that a hierarchy of representation learning and STDPrules can extract visual features for classification. The limita-tions of the previous work were: 1) the representation learningwas not temporally local; and, 2) although the network archi-

978-1-5090-6014-6/18/$31.00 ©2018 IEEE

Page 2: Training Spiking ConvNets by STDP and Gradient Descentstatic.tongtianta.site/paper_pdf/94fdc5ba-5b60-11e... · tion learning component are used for the convolutional layers. The supervised

Fig. 1. Spiking CNN. This architecture shows two convolutional/poolinglayers followed by a fully connected, multi-layer SNN. Orange chevronsindicate trainable parameters and blue chevrons do not undergo learning (max-pooling).

tecture could extract visual features, the classifier was not anSNN but, instead, a non-spiking support-vector machine.

The present paper proposes multi-layer learning for de-veloping a spiking CNN using novel spike-based, spatio-temporally local learning rules for both the feature extractionand classification components. Our new result is a pure spikingCNN because it avoids a non-spiking classification layer.

This paper proposes two sets of learning rules applied to aspiking representation learning network and a fully connected,multi-layer SNN. The former offers an unsupervised learningscheme while the latter develops a supervised learning tech-nique for classification. The kernels trained by the representa-tion learning component are used for the convolutional layers.The supervised learning rule approximates gradient-descentusing STDP to provide a spike-based, spatio-temporally localclassifier. The gradient descent can be used in hidden layersof the fully connected portion of the network to approximatebackpropagation. The convolutional layers and the spikingclassifier (fully connected SNN) are stacked layer-wise.

II. NETWORK ARCHITECTURE

The components needed to build a spiking CNN that usesa greedy, layer-wise learning scheme are a spiking representa-tion unsupervised learning model and a fully connected SNNequipped with supervised learning, as shown in Fig. 1. Itconsists of two convolutional layers, two pooling layers, and afully connected SNN for classification. The convolutional lay-ers and the fully connected SNN are trained layer-wise whilethe pooling layers that follow the convolutional layers onlysub-sample the most active neurons in a square neighborhoodof the leaky integrate-and-fire (LIF) neurons in the featuremaps (analogous to the max-pooling operation in traditionalCNNs [25], [26], [27]). The last layer contains ten neurons toclassify handwritten digits zero through nine.

Inputs to the network are of size n×m× ci, where n is theinput width, m is the input height, and ci is the depth of theinput at a given layer i. The convolution filters at layer i mapp×p×ci pixels from the feature maps in layer i (i = 1 for theinput layer) to ci+1 feature maps in the (i+ 1)th layer. Thesefilters are trained using STDP-based representation learningexplained in Section III-A. The features extracted from theconvolutional/pooling layers are flattened and fed into the

multi-layer SNN equipped with an STDP-based supervisedlearning algorithm (Section III-B). The fully-connected SNNwill then activate a neuron in the final layer yielding thepredicted classification of the current input image.

III. LEARNING RULES

The learning rules include two components. The first learn-ing component provides unsupervised representation learningembedded in a single-layer, feedforward SNN. The secondcomponent proposes an approximation of backpropagationusing spatio-temporally local STDP rules in multi-layer SNNs.

A. Representation Learning using STDP

This section proposes STDP-type rules embedded in asingle layer SNN for spatio-temporal feature coding inspiredby previous work [28], [29], [30]. The Foldiak model [29]first introduced a set of three learning rules: Hebbian, anti-Hebbian, and homeostatic. These rules acheived representationcoding in a non-spiking neural network. Zylderberg et al. [30]later modified Foldiak’s plasticity rules in order to derivea sparse representation model with SNNs named SAILnet.SAILnet utilized spiking neurons for its representation layerand maintained spatial locality in its plasticity rules. Thelacking features, however, were that the learning rules werenot temporally local and the inputs used pixel intensity, notspike trains. Without the usage of spike times, the question oftraining using an approach that is spatio-temporally local andspike-based (as STDP [31]) remains unresolved. Our proposedrepresentation learning introduces learning rules that are localin both time and space to implement an approximation ofclustering-based vector quantization [32] using an SNN whilecontrolling the sparseness and independence of visual codes.We also introduce a new threshold adjustment rule using awinner-takes-all (WTA) circuit to maintain independence andsparsity.

1) Spiking Representation Coding: Our model adopts aconstrained optimization approach to develop learning rules tobe embedded in an SNN as shown in Fig. 2. The representationlayer encodes a p× p image patch (p× p spike trains) usingD spike trains generated by postsynaptic neurons, zj , in therepresentation layer.

We derive plasticity rules that operate over a stimuluspresentation interval T (non-local) and then take the limit asT tends to one local time step to derive event-based rules.The objective function using the vector quantization criterionis shown below.

F (xi, wcji) = yj(xi − wcji)2 , yj =

∑i

xiwcji (1)

The parameters xi, yj , and wcji ≥ 0 are normalized input pixelintensities in the range [0, 1], the linear output activation,and the synaptic weight respectively. Eq. 1 shows a vectorquantization criterion that is scaled by the output neuron’sactivity (yj). The output neuron’s activity scales the weightupdate rule according to the neuron’s response to the inputpattern (xi). We assume that the input and output values canbe converted to the spike counts over T ms.

Page 3: Training Spiking ConvNets by STDP and Gradient Descentstatic.tongtianta.site/paper_pdf/94fdc5ba-5b60-11e... · tion learning component are used for the convolutional layers. The supervised

Fig. 2. Spiking representation network. p×p image patch encoded by D spiketrains in representation layer. W shows the synaptic weight sets correspondingto the D kernels.

In response to a stimulus, a subset of neurons in therepresentation layer are activated to encode the input. Torepresent the stimuli by uncorrelated codes, the neurons shouldbe activated independently and sparsely. That is, the repre-sentation layer can use a WTA neural implementation. Thiscriterion can be achieved by a soft constraint such that

g(x) =∑j

zj ≤ 1⇒ 1−(∑

j

zj)≥ 0 . (2)

zj denotes the binary state of unit j after the T ms presentationinterval where zj = 1 if unit j fires at least once. Thefiring status of a particular neuron can be controlled by itsthreshold, θc. Therefore, this constraint can be addressed by athreshold adjustment rule. Note: the superscript ‘c’ indicatesa convolutional layer.

The goal is to minimize the objective function (Eq. 1) whilemaintaining the constraint (Eq. 2). This can be achieved byintroducing a Lagrangian function

L(xi, yj , z;wcji, α) = yj(xi − wcji)2︸ ︷︷ ︸Objective Function

−α(1−∑j

zj)︸ ︷︷ ︸Constraint

(3)

where, α is a Lagrange multiplier. Minimizing the first com-ponent of Eq. 3 creates a coding module that represents theinput as a new feature vector which clusters the data via thesynaptic weights. Minimizing the second component supportsthe sparsity and independence of the representation to yield aWTA network where one neuron fires upon stimulus presenta-tion. This is accomplished by adapting the neuron’s threshold,θc = −α. The optimum of the Lagrangian function can beobtained by performing gradient descent on its derivatives

∂L

∂wcji= −2yj(xi − wcji) (4)

∂L

∂θc= −

(∑j

zj − 1)

(5)

From gradient descent on Eq. 4 (reversing the sign on thederivative), we obtain

∆wcji ∝ yj(xi − wcji) (6)

The information needed in Eq. 6 is not yet temporally local. xidenotes the rescaled pixel intensity and it does not represent

the input spike train. To re-encode a pixel intensity, xi, to aspike train, Gi, we use uniformly distributed spikes (each spiketrain has a different random lag) with the rate of normalizedpixel intensity in the range [0, 1]. The maximum numberof spikes (for a completely white pixel) for a T = 40 msinterval is 40. Additionally, yj is a positive value (spike count)denoting the neuron’s activation in response to a stimuluspresentation and is not available at synapse, wji. The valueyj can be re-expressed as Hj representing the output spiketrain of neuron j. Spike trains Gi and Hj are formulated bythe sum of Dirac functions as shown in Eq. 7.

Gi(t) =∑tf∈Sf

i

δ(t− tf) , Hj(t) =∑tf∈Rf

j

δ(t− tf) (7)

Sfi and Rf

j are the sets of presynaptic and postsynaptic spiketimes. After coding xi and yj by spike trains Gi and Hj

respectively, we seek to propose a local, STDP learning rulefollowing Eq. 6. When, xi and yj are coded by spike trainsover T ms, the synaptic change in continuous time is givenby

∆wcji ∝[ ∫ T

0

Hj(t′)dt′

][1

K

∫ T

0

Gi(t′)dt′ − wcji

](8)

Where, K is a normalizer denoting the maximum number ofpresynaptic spikes in T ms interval. Over a short time period(t ∈ [t′, t′ + γ), γ < 1 ms, so that K = 1), the weightadjustment at time t is calculated by

∆wcji(t) ∝ rj(t)(si(t)− wcji(t)

)(9)

rj(t) shows the firing status of neuron j at time t (rj(t) ∈{0, 1}). si(t) specifies the presynaptic spike emitted fromneuron i at time interval (t−η, t]. In our experiments η = 1 ms.The synaptic weight is changed only when a postsynaptic spikeoccurs (rj(t) = 1). Finally, the learning rule is formulated(upon firing of output neuron j) as follows

∆wcji(t) ∝ si(t)− wcji(t). (10)

Where, wcji ≥ 0. This learning rule is applied when an outputneuron fires. The weight change is related to the presynapticspike times received by the output neurons. This idea isvery similar to a popular local learning rule in a biologicallyplausible SNN named STDP. In this STDP rule (Eq. 10),the current synaptic weight controls the weight change. Forinstance, if wcji ∈ [0, 1], the smaller weights undergo largerLTP and LTD; and vice versa. Eq. 11 shows the final STDPrule derived from Eq. 10. The weights fall in the range [0, 1]and are initialized randomly between 0 and 1.

∆wcji =

{a ·(1− wcji

), if si = 1

a ·(− wcji

), if si = 0

(11)

The STDP rule triggers if the postsynaptic neuron fires(rj(t) = 1). To implement inhibition in the representationlayer during training, a spiking softmax function calculatingthe firing probability [24], [33, p. 181] is used (Eq. 12). If the

Page 4: Training Spiking ConvNets by STDP and Gradient Descentstatic.tongtianta.site/paper_pdf/94fdc5ba-5b60-11e... · tion learning component are used for the convolutional layers. The supervised

neuron’s firing probability, Aj(t), reaches the threshold, θc, itfires and its membrane potential (Uj(t)) resets to zero.

Aj(t) =exp

(Uj(t)

)∑Dk=1 exp

(Uk(t)

) (12)

The second adaptation rule is referred to as the thresholdlearning rule. Eq. 5 is used to implement a learning rulefor adjusting the threshold, θc. The threshold learning ruleshown in Eq. 13 provides an independent and sparse featurerepresentation. The threshold is the same for all D neurons inthe representation layer.

∆θc = b(mz − 1

)(13)

where, b is the learning rate and mz is the number of neuronsin the representation layer firing in T ms. This rule adjuststhe threshold such that only one neuron fires in response to astimulus. This criterion provides a framework to extract inde-pendent features in a sparse representation. In the experiments,the initial threshold is set to 0.15.

A version of the proposed representation learning has showngood performance (in terms of reconstruction loss and spar-sity) for the visual spike coding of the natural images in ourrecent study [34]. This approach outperformed the existingspiking representation learning models [35], [36].

B. Backpropagation Approximation by STDP

Spiking neurons communicate via discrete spike events. Animportant question in implementing spiking networks is howthey are trained under supervision without a differentiableactivation function? This section proposes novel multi-layer,supervised learning rules to train networks of integrate-and-fire (IF) neurons. The proposed approach uses bio-inspiredSTDP and high performance backpropagation (gradient de-scent) rules. We showed in our recent study [23] that the IFneurons approximate the rectified linear units (ReLU). Thisapproximation is the basis of the proposed STDP-based super-vised learning named BP-STDP. BP-STDP introduces a novel,temporally local learning approach specified by an STDP/anti-STDP rule derived from the backpropagation weight changerules that can be applied at each time step.

The proposed learning rules are inspired from the back-propagation update rules reported for neural networks thatutilize a ReLU activation function. Figure 3 shows the networkarchitectures and parameters used to describe the conventionaland spiking neural networks in this paper. The main differencebetween these two networks is their data communication,where the neural network (left) receives and generates realnumbers and the SNN (right) receives and generates spiketrains in T ms time intervals.

Non-spiking neural networks equipped with gradient de-scent (GD) solve an optimization problem in which thesquared difference between the desired, d, and output, o,values is minimized [37], [33]. A common objective function

Fig. 3. The two-layer conventional (left) and spiking (right) network archi-tectures. The SNN receives spike trains representing input feature values inT ms. The learning rules and the network status in the SNN are specified byan additional term as time (t). The formulas and parameters are discussed inEqs. 14 through 26.

computed for M output neurons receiving N training samplesis shown in Eq. 14.

E =1

N

N∑k=1

M∑i=1

(dk,i − ok,i)2 (14)

The weight change formula (using GD with a learning rate ofµ) for a linear output neuron, i, receiving H inputs, oh, for asingle training sample is achieved by

E =(di −

∑h

ohwih)2 → ∂E

∂wih= −2(di − oi) · oh (15)

By reversing the sign on the derivative, we have

∆wih ∝ −∂E

∂wih→ ∆wih = µ(di − oi)oh (16)

By assuming di, oi, and oh as the spike counts of spiketrains Li, Gi, and Gh [34] (see Eq. 17), respectively, theweight change, defined above, can be re-expressed such thatit computes the synaptic weight update in an SNN. Eq. 18shows this update rule after T = 50 ms.

Gi(t) =∑

tpi∈{ri(t)=1}

δ(t− tpi ) (17a)

Li(t) =∑

tqi∈{zi(t)=1}

δ(t− tqi ) (17b)

Gh(t) =∑

tph∈{sh(t)=1}

δ(t− tph) (17c)

∆wih = µ

∫ T

0

(Li(t

′)−Gi(t′))dt′ ·

∫ T

0

Gh(t′)dt′ (18)

However, the weight change rule in Eq. 18 is not temporallylocal. To make the learning rule local in time (similar to thederivation of Eq. 9 from Eq. 8), we break the time interval, T ,into sub-intervals such that each sub-interval contains at mostone spike. Hence, the learning rule in a short time period ofEq. 18, becomes Eq. 19.

∆wih(t) ∝ µ(zi(t)− ri(t)

)sh(t) (19)

To implement the formula above, a combination of STDP andanti-STDP is used. The proposed rule updates the synaptic

Page 5: Training Spiking ConvNets by STDP and Gradient Descentstatic.tongtianta.site/paper_pdf/94fdc5ba-5b60-11e... · tion learning component are used for the convolutional layers. The supervised

weights using a teacher signal to switch between STDP andanti-STDP. That is, the target neuron undergoes STDP and thenon-target ones undergo anti-STDP. The desired spike trains,z, are defined based on the input’s label. Therefore, the targetneuron is represented by a spike train with maximum spikefrequency (β) and the non-target neurons are silent. Also, thelearning rule triggers at desired spike times, zi(t) (the desiredspike times are the same for all the target neurons). Eq. 20shows the weight change that is applied to the output layer ofour supervised SNN.

∆wih(t) = µ · ξi(t)t∑

t′=t−εsh(t′) (20)

ξi(t) =

1, zi(t) = 1, ri 6= 1 in [t− ε, t]−1, zi(t) = 0, ri = 1 in [t− ε, t]0, otherwise

(21)

Then, the synaptic weights of the output layer are updated by

wih(t) = wih(t) + ∆wih(t) (22)

The target neuron is determined by the teaching signal zi(t),where zi(t) = 1 denotes the target neuron and zi(t) = 0denotes non-target neurons. The weight change scenario inthe output layer starts with the desired spike times. At desiredspike time t, the target neuron should fire within the STDPwindow [t−ε, t]. Otherwise, the synaptic weights are increasedproportionally by the presynaptic activity (mostly zero or onespike) in the same time interval. The presynaptic activityis denoted by

∑tt′=t−ε sh(t′) which counts the presynaptic

spikes in the [t − ε, t] interval. On the other hand, the non-target neurons upon firing undergo weight depression in thesame way. This scenario is inspired from traditional GD whilesupporting spatio-temporal, local learning.

The above learning rule works for a single layer SNNtrained with supervision. To train a multi-layer SNN, we usethe same idea that is inspired from the traditional backpropa-gation rules. The backpropagation weight change rule appliedto a hidden layer of ReLU neurons is shown in Eq. 23.

∆whj = µ ·(∑

i

ξiwih)· oj · [oh > 0] (23)

ξi denotes the difference between the desired and output values(di− oi). In our SNN, ξi is approximated by ξi (Eq. 21). Thevalue [oh > 0] specifies the derivative of ReLU neurons inthe hidden layer. Using the approximation of the IF neuronsto the ReLU neurons, similar to the output layer (Eq. 18), theweight change formula can be re-expressed in terms of spikecounts in a multi-layer SNN as shown in Eq. 24.

∆whj = µ

∫ T

0

(∑i

ξi(t′)wih(t′)

)dt′·∫ T

0

(∑tpj

δ(t′ − tpj ))dt′ ·

([∫ T

0

∑tph

δ(t′ − tph)dt′]> 0

)(24)

After dividing T into short sub-intervals [t − ε, t], the tem-porally local rule for updating the hidden synaptic weights isformulated as follows.

∆whj(t) =

{µ ·∑i ξi(t)wih(t) · a(t) , sh = 1 in [t− ε, t]

0 , otherwise

a(t) =t∑

t′=t−εsj(t

′)

(25)Finally, the synaptic weights of the hidden layer are updatedby

whj(t) = whj(t) + ∆whj(t) (26)

The above learning rule can be non-zero when the hiddenneuron h fires (postsynaptic spike occurrence). Thus, theweights are updated according to the presynaptic (sj(t)) andpostsynaptic (sh(t)) spike times, analogous to the standardSTDP rule. Additionally, the derivative of ReLU (oh > 0) isanalogous to the spike generation in the IF neurons (see thecondition in Eq. 25). Following this scenario for the spatio-temporally synaptic weight change rule, we can build a multi-layer SNN equipped by the STDP-based backpropagationalgorithm, named BP-STDP.

Implementation: To implement the spiking CNN,Eqs. 11, 12, 13, 20, 21, 22, 24, 25, and 26 were used. Theimplementation codes of this paper is available on GitHub viahttps://github.com/ tavanaei/Spiking-CNN (from May 2018).

IV. EXPERIMENTS AND RESULTS

To evaluate the proposed spiking CNN, we ran three experi-ments. The first experiment tested the BP-STDP on a two-layerfully connected spiking network to solve the spiking XORproblem. The remaining two experiments used the MNISTdataset [38] consisting of 60k training and 10k testing imagesof handwritten digits with 28× 28 gray-scaled pixels.

A. BP-STDP: XOR Evaluation

The BP-STDP algorithm is initially tested on the exclusive-OR (XOR) problem to show its ability to solve linearinseparability. The dataset contains four input data points{(0.2, 0.2), (0.2, 1), (1, 0.2), (1, 1)} with labels {0, 1, 1, 0}.We used 0.2 instead of 0 to activate the IF neurons (and releasespikes). The network architecture consists of 2 input, H = 20hidden, and 2 output IF neurons. Each input neuron releasesspike trains corresponding to the input values such that thevalue 1 is represented by a spike train with the maximumspike rate (250 Hz). The weights for the experiments wereinitialized randomly by sampling from a Gaussian distributionwith µ = 0 and σ = 1.

Figure 4A shows the training process where each boxrepresents the two output neurons’ activities with respect to thefour input spike patterns determining {0, 1, 1, 0} classes. Afteraround 150 training iterations, the output neurons becomeselective to the input categories. Figure 4B shows the learningconvergence progress using the cost function defined in Eq. 27.

Page 6: Training Spiking ConvNets by STDP and Gradient Descentstatic.tongtianta.site/paper_pdf/94fdc5ba-5b60-11e... · tion learning component are used for the convolutional layers. The supervised

Fig. 4. A: Spike trains released by output neurons in response to four pairs ofinput trains representing {(0.2, 0.2), (0.2, 1), (1, 0.2), (1, 1)} values through500 iterations. We used high spike rates for better visualization. B: MSE ofthe XOR learning process (left: through 500 iterations, right: after training).θh, θo, and U0 are the hidden threshold, the output threshold, and the restingpotential, respectively.

This figure shows that the proper learning rates, µ, fall in therange [0.01, 0.0005].

MSE =1

N

N∑k=1

( 1

T

T∑t=1

ξk(t))2, ξk(t) =

∑i

ξki (t) (27)

In Eq. 27, N and ξki (t) denote the training batch size and theerror value of output neuron i in response to sample k.

B. MNIST Experiments

For the MNIST experiments, the input spike trains weregenerated by random lags with the spike rates proportionalto the normalized pixel values in the range [0, 1]. The firstexperiment evaluates the BP-STDP algorithm. The secondexperiment shows the convolutional layers’ role in extractingvisual features and the final classification using the spikingCNN.

TABLE IMNIST CLASSIFICATION PERFORMANCE OF THE PROPOSED BP-STDP

APPLIED TO SNNS IN COMPARISON WITH TRADITIONALBACKPROPAGATION APPLIED TO CONVENTIONAL NEURAL NETWORKS.

Model H1=300 H1=1000 H1=500; H2=150ANN [39] 95.3 95.5 97.1ANN [Distortions] [39] 96.4 96.2 97.6BP-STDP 95.7 96.6 ± 0.1 97.2 ± 0.07

1) BP-STDP: MNIST Evaluation: The SNN with one ormore hidden layers, 784 input, and 10 output IF neuronsequipped with BP-STPD was compared to conventional neuralnetworks equipped with backpropagation. Table I comparesthe proposed supervised learning method (BP-STDP) withthe traditional backpropagation algorithm (GD) where bothtrain 2-layer and 3-layer networks. This comparison confirmsthe success of the bio-inspired BP-STDP rule applied to thetemporal SNN architecture. This spiking network architectureequipped with the BP-STDP learning algorithm performs asaccurately as the traditional backpropagation algorithm usedin artificial NNs. Thus, it can be used to classify the spikingfeature maps generated by convolutional layers.

2) Spiking CNN: MNIST Evaluation: At each level of theconvolutional hierarchy, the input images are processed andrepresented by discriminative features. Each layer consists offilters (kernels) trained by the representation learning methodin Section III-A. The convolution of each filter over receivedinputs generates feature maps that convey information tosubsequent layers.

Fig. 5 shows the learned kernels and feature maps of thespiking convolutional layers trained by the STDP-based rep-resentation learning. The first layer’s kernels extract primaryvisual features. This layer is selective to different orientationsrepresented in the image patches (5×5 windows). The secondlayer extracts complex features by convolving the generatedfeature maps with D2 = 16 kernel cubes of size 5× 5×D1,where D1 = 16. The hyperparameters are shown in this figureas well.

The output neurons represent digits zero through nine andemit spikes in response to the features provided by the earlierlayers. As shown in Fig. 5, only the target neuron fires spikeswhile the other neurons are silent. The spiking activity ofthe output neural layer in Fig. 5 illustrates the spiking neuralactivities in the T = 50 ms interval for each digit.

To assess the effect of the convolutional layers on classifi-cation performance, two architectures with one and two con-volutional/pooling layers, respectively, were evaluated. Theseconvolutional/pooling layers were followed by an SNN withH1 = 1500 hidden neurons equipped with BP-STDP. Thenumber of spiking neurons in the feature maps of the firstand second convolutional/pooling layers are 14×14×D1 and7× 7×D2, respectively. Fig 6 shows the performance of theproposed spiking CNNs with D1 = {8, 16, 32, 64} primaryconvolutional filters (feature maps) and D2 = {16, 32, 64}convolutional filters stacked on the primary feature maps.These results demonstrate the success of the STDP-based

Page 7: Training Spiking ConvNets by STDP and Gradient Descentstatic.tongtianta.site/paper_pdf/94fdc5ba-5b60-11e... · tion learning component are used for the convolutional layers. The supervised

Fig. 5. Spiking CNN training (using N = 30k digits) and evaluation proce-dure. The convolutional layers are trained by the STDP-based representationlearning model. Supervised learning (BP-STDP) is applied to the feature mapsgenerated by either the first or the second convolutional/pooling layer. Thespiking activity of output neurons in response to handwritten digits is the finaloutput of the model. lp shows the pooling stride.

Fig. 6. The proposed spiking CNN’s performance on MNIST. The networkarchitecture consists of either one or two convolutional/pooling layers (D1

and D2), a fully connected SNN with H1 = 1500, and 10 output spikingneurons. Maximum accuracy > 98.5% has been achieved.

representation learning in visual feature extraction using spa-tial convolution. After applying the convolutional layers, theaccuracy rate of the SNN was improved significantly (twotailed t-test: p-value < 0.0005) by about 1.4% in the range[95, 100]. However, the second convolutional layer did notimprove the performance in comparison with the networks ofonly one convolutional layer. This results from the limitedcapacity of spiking neurons for representing small differencesbetween neural activities in our approach.

Our spiking CNN showed 98.6% accuracy for MNIST clas-sification. This performance is comparable with the state-of-the-art SNNs while our method introduces hardware friendly,STDP-like learning rules for both the feature extraction (con-

TABLE IICLASSIFICATION PERFORMANCES OF RECENT SNNS ON MNIST. THE

ITALIC FONT (AND ‘*’) DENOTES THE STDP-BASED SNNS.

Model Description Acc.

O’Connor [20] Deep SNN.Stochastic gradient descent 96.40

O’Connor [20] Deep SNN.Fractional stochastic GD 97.93

Lee [19] Deep SNN.Backprop. using membrane potential 98.88

Eliasmith [40] SNN.Spaun brain model 94.00

Lee [19] Spiking CNN.Backprop. using membrane potential 99.31

Panda [17] Spiking CNN.Convolutional autoencoder: Backprop 99.05

Diehl [41] 2-layer SNN. *STDP; Example-based classifier 95.00

Tavanaei [24] Spiking CNN. *Sparse coding and STDP, SVM classifier 98.36

Kheradpisheh [16] Spiking CNN. *Layer wise STDP, SVM classifier 98.40

Neftci [42] Spiking RBM. *Contrastive divergence for IF neurons 91.90

Our Model Spiking CNN. *STDP rep. learning and BP-STDP 98.60

volutional layers) and classification (fully connected SNN)components. Additionally, this approach gets one step closer tothe biologically plausible pattern recognition scheme that oc-curs in the brain. Table II compares our model’s performancewith recent learning approaches in SNNs. As shown in thistable, the gradient descent method has performed better in [19]and [17], but it uses the derivative of neurons’ membranepotentials and it does not implement STDP. Our methodin comparison with [24] and [16] develops an STDP-basedclassifier while they use an SVM (a non-spiking classifier) toclassify the extracted features.

V. CONCLUSION

We have developed and tested new STDP-based, spatio-temporally local learning rules for both unsupervised repre-sentation learning (feature discovery) and supervised learningto train spiking CNNs (ConvNets). Although STDP-based,the supervised learning was shown to approximate gradientdescent and was successfully tested on a simple SNN tolearn the non-linearly separable problem, exclusive-OR, withgradient descent learning in both the hidden and output layers.

We also tested the combined supervised and unsupervisedrules in two convolutional architectures. The experimentalresults showed the good performance on the MNIST dataset.Our reported accuracy rate of 98.6% on the MNIST datasetoutperformed existing STDP-based, multi-layer (deep) SNNs.

We were surprised to see that using more than one convo-lutional layer did not improve MNIST performance. That isour best performing CNN used only one convolutional layer.We expected the second layer to learn more discriminativefeatures, and since this indicates a failure in hierarchicalrepresentation learning, our future work will study why thisdid not happen.

Page 8: Training Spiking ConvNets by STDP and Gradient Descentstatic.tongtianta.site/paper_pdf/94fdc5ba-5b60-11e... · tion learning component are used for the convolutional layers. The supervised

Another limitation of our work is in the max-pooling layerwhich selects the most active neuron (with maximum spikerate) in a neighborhood of IF neurons in the feature maps. Thisapproach slows down the training and evaluation phases andis not biologically plausible. Our future work will introducea new pooling layer which subsamples spiking neurons in atemporal scheme using neural latency.

REFERENCES

[1] S. Ghosh-Dastidar and H. Adeli, “Spiking neural networks,” Interna-tional journal of neural systems, vol. 19, no. 04, pp. 295–308, 2009.

[2] N. Kasabov, K. Dhoble, N. Nuntalid, and G. Indiveri, “Dynamic evolvingspiking neural networks for on-line spatio-and spectro-temporal patternrecognition,” Neural Networks, vol. 41, pp. 188–201, 2013.

[3] N. K. Kasabov, “Neucube: A spiking neural network architecture formapping, learning and understanding of spatio-temporal brain data,”Neural Networks, vol. 52, pp. 62–76, 2014.

[4] W. Maass, “Networks of spiking neurons: the third generation of neuralnetwork models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997.

[5] ——, “To spike or not to spike: that is the question,” Proceedings ofthe IEEE, vol. 103, no. 12, pp. 2219–2224, 2015.

[6] T. Masquelier and S. J. Thorpe, “Unsupervised learning of visualfeatures through spike timing dependent plasticity,” PLoS Comput Biol,vol. 3, no. 2, p. e31, 2007.

[7] M. Beyeler, N. D. Dutt, and J. L. Krichmar, “Categorization anddecision-making in a neurobiologically plausible spiking network usinga STDP-like learning rule,” Neural Networks, vol. 48, pp. 109–124,2013.

[8] S. R. Kheradpisheh, M. Ganjtabesh, and T. Masquelier, “Bio-inspiredunsupervised learning of visual features leads to robust invariant objectrecognition,” Neurocomputing, vol. 205, pp. 382–392, 2016.

[9] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer, “Fast-classifying, high-accuracy spiking deep networks through weight andthreshold balancing,” in Neural Networks (IJCNN), 2015 InternationalJoint Conference on. IEEE, 2015, pp. 1–8.

[10] S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S. Modha,“Backpropagation for energy-efficient neuromorphic computing,” inAdvances in Neural Information Processing Systems, 2015, pp. 1117–1125.

[11] B. Ruckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu, “Conversionof continuous-valued deep networks to efficient event-driven networksfor image classification,” Front. Neurosci. 11: 682. doi: 10.3389/fnins,2017.

[12] D. Neil, M. Pfeiffer, and S.-C. Liu, “Learning to be efficient: Algorithmsfor training low-latency, low-compute deep spiking neural networks,” inProceedings of the 31st Annual ACM Symposium on Applied Computing.ACM, 2016, pp. 293–298.

[13] E. Hunsberger and C. Eliasmith, “Training spiking deep networks forneuromorphic hardware,” arXiv preprint arXiv:1611.05141, 2016.

[14] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neuralnetworks for energy-efficient object recognition,” International Journalof Computer Vision, vol. 113, no. 1, pp. 54–66, 2015.

[15] P. O’Connor, D. Neil, S.-C. Liu, T. Delbruck, and M. Pfeiffer, “Real-time classification and sensor fusion with a spiking deep belief network,”Frontiers in neuroscience, vol. 7, 2013.

[16] S. R. Kheradpisheh, M. Ganjtabesh, S. J. Thorpe, and T. Masquelier,“STDP-based spiking deep neural networks for object recognition,”arXiv preprint arXiv:1611.01421, 2016.

[17] P. Panda and K. Roy, “Unsupervised regenerative learning of hierarchicalfeatures in spiking deep networks for object recognition,” in NeuralNetworks (IJCNN), 2016 International Joint Conference on. IEEE,2016, pp. 299–306.

[18] A. Tavanaei and A. S. Maida, “Bio-inspired spiking convolutional neuralnetwork using layer-wise sparse coding and STDP learning,” arXivpreprint arXiv:1611.03000, 2016.

[19] J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neuralnetworks using backpropagation,” Frontiers in neuroscience, vol. 10,2016.

[20] P. O’Connor and M. Welling, “Deep spiking networks,” arXiv preprintarXiv:1602.08323, 2016.

[21] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Spatio-temporal backprop-agation for training high-performance spiking neural networks,” arXivpreprint arXiv:1706.02609, 2017.

[22] D. Huh and T. J. Sejnowski, “Gradient descent for spiking neuralnetworks,” arXiv preprint arXiv:1706.04698, 2017.

[23] A. Tavanaei and A. S. Maida, “Bp-STDP: Approximating back-propagation using spike timing dependent plasticity,” arXiv preprintarXiv:1711.04214, 2017.

[24] ——, “Multi-layer unsupervised learning in a spiking convolutionalneural network,” in Neural Networks (IJCNN), 2017 International JointConference on. IEEE, 2017, pp. 2023–2030.

[25] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015.

[26] J. Schmidhuber, “Deep learning in neural networks: An overview,”Neural Networks, vol. 61, pp. 85–117, 2015.

[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[28] P. Foldiak, “Adaptive network for optimal linear feature extraction,”in Neural Networks (IJCNN), 1989 International Joint Conference on,vol. 1. IEEE, 1989, pp. 401–405.

[29] P. Foldiak, “Forming sparse representations by local anti-Hebbian learn-ing,” Biological Cybernetics, vol. 64, no. 2, pp. 165–170, 1990.

[30] J. Zylberberg, J. T. Murphy, and M. R. DeWeese, “A sparse codingmodel with synaptically local plasticity and spiking neurons can accountfor the diverse shapes of V1 simple cell receptive fields,” PLoS ComputBiol, vol. 7, no. 10, p. e1002250, 2011.

[31] H. Markram, W. Gerstner, and P. J. Sjostrom, “Spike-timing-dependentplasticity: a comprehensive overview,” Frontiers in synaptic neuro-science, vol. 4, 2012.

[32] A. Coates and A. Y. Ng, “Learning feature representations with k-means,” in Neural networks: Tricks of the trade. Springer, 2012, pp.561–580.

[33] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,2016.

[34] A. Tavanaei, T. Masquelier, and A. Maida, “Representation learningusing event-based STDP,” arXiv preprint arXiv:1706.06699, 2017.

[35] K. S. Burbank, “Mirrored STDP implements autoencoder learning ina network of spiking neurons,” PLoS computational biology, vol. 11,no. 12, p. e1004566, 2015.

[36] P. D. King, J. Zylberberg, and M. R. DeWeese, “Inhibitory interneuronsdecorrelate excitatory cells to drive sparse code formation in a spikingmodel of V1,” Journal of Neuroscience, vol. 33, no. 13, pp. 5475–5485,2013.

[37] C. M. Bishop, Pattern recognition and machine learning. springer,2006.

[38] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database,” URLhttp://yann. lecun. com/exdb/mnist, 1998.

[39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, 1998.

[40] C. Eliasmith, T. C. Stewart, X. Choo, T. Bekolay, T. DeWolf, Y. Tang,and D. Rasmussen, “A large-scale model of the functioning brain,”science, vol. 338, no. 6111, pp. 1202–1205, 2012.

[41] P. U. Diehl and M. Cook, “Unsupervised learning of digit recognitionusing spike-timing-dependent plasticity,” Frontiers in computationalneuroscience, vol. 9, 2015.

[42] E. Neftci, S. Das, B. Pedroni, K. Kreutz-Delgado, and G. Cauwenberghs,“Event-driven contrastive divergence for spiking neuromorphic systems,”Frontiers in neuroscience, vol. 7, p. 272, 2014.


Recommended