Capacity analysis in multi-state synaptic models: a...

J Comput Neurosci (2011) 30:699–720DOI 10.1007/s10827-010-0287-7

Capacity analysis in multi-state synaptic models:a retrieval probability perspective

Yibi Huang · Yali Amit

Received: 13 April 2010 / Revised: 20 September 2010 / Accepted: 5 October 2010 / Published online: 27 October 2010© Springer Science+Business Media, LLC 2010

Abstract We define the memory capacity of networksof binary neurons with finite-state synapses in terms ofretrieval probabilities of learned patterns under stan-dard asynchronous dynamics with a predeterminedthreshold. The threshold is set to control the proportionof non-selective neurons that fire. An optimal inhibi-tion level is chosen to stabilize network behavior. Forany local learning rule we provide a computationallyefficient and highly accurate approximation to the re-trieval probability of a pattern as a function of its age.The method is applied to the sequential models (Fusiand Abbott, Nat Neurosci 10:485–493, 2007) and meta-plasticity models (Fusi et al., Neuron 45(4):599–611,2005; Leibold and Kempter, Cereb Cortex 18:67–77,2008). We show that as the number of synaptic statesincreases, the capacity, as defined here, either plateausor decreases. In the few cases where multi-state modelsexceed the capacity of binary synapse models the im-provement is small.

Action Editor: Mark van Rossum

Electronic supplementary material The online version of thisarticle (doi:10.1007/s10827-010-0287-7) containssupplementary material, which is available to authorizedusers.

Supported in part by NSF ITR DMS-0706816.

Y. Huang (B)Department of Statistics, University of Chicago,5734 S University Ave, Chicago, IL 60637, USAe-mail: [email protected]

Y. AmitDepartments of Statistics and Computer Science,University of Chicago, Chicago, IL, USAe-mail: [email protected]

Keywords Hebbian learning · Network dynamics ·Inhibition · Sparse coding

1 Introduction

Inspired by the delay activity observed in numerousdelayed match to sample experiments (see Fuster 1995;Miyashita and Hayashi 2000; Wang 2001 for reviews),the stable states of neural network dynamics are con-sidered to be a promising candidate mechanism un-derlying working memory (see Del Giudice et al. 2003for a review on this conjecture). The simplest modelfor synaptic modification leading to memory retrievalcan be traced to Willshaw et al. (1969) where synapsesbetween two ‘on’ binary neurons are simply set to 1and the others to 0. This network is not dynamic inthat learning is performed once for all the patterns.A stochastic modification which allows for dynamiclearning and gradual erasure of older patterns from thememory trace was proposed in Amit and Fusi (1994).The network is trained with an uninterrupted sequenceof uncorrelated binary stimuli and the synapses arebinary variables that transition between the potentiatedand depressed states based on the activity of the preand post-synaptic neurons of a presented stimulus. Thismodel represents the simplest form of dynamic learningwith purely local Hebbian modification of the synapses.Stability of the learned patterns is determined in termsof the signal-to-noise ratio (SNR) of the fields ofselective/non-selective neurons for a pattern learned inthe past. No explicit readout mechanism for accessingthe information stored in the synapses is addressed. Thelearning process is analyzed using Markov chain theoryshowing that the memory trace of old patterns decays

http://dx.doi.org/10.1007/s10827-010-0287-7

700 J Comput Neurosci (2011) 30:699–720

exponentially fast with the number of learned patterns.For any fixed setting of the parameters (coding leveland learning rates) the network capacity was shown togrow at most logarithmically in the number of neuronsin the network. Higher capacity, up to nearly quadraticin the number of neurons, is achieved by adjusting thecoding level and the transition rates to the size of thenetwork.

Fusi et al. (2005) and Fusi and Abbott (2007) ex-tended the analysis to multi-state synapses and hiddensynaptic states. Again capacity is studied in terms of theSNR, which can be viewed as the ideal observer point ofview. In this context two different but related quantitiesare studied, the storage capacity on one hand and theinformation content per synapse on the other. In otherwords how much information is stored in the synapsesassuming it can somehow be fully recovered.

One possible readout mechanism involves patternretrieval in some form of dynamics. In a recurrentnetwork pattern retrieval means the stored pattern isa stable state of the dynamics. In a two-layer feedfor-ward network pattern retrieval means the activity of apattern in the cue layer is sufficient to evoke activityof another pattern in the target layer, but there is noself-sustained activity. In either case, a threshold mustbe chosen. Requiring the SNR above a certain level isnecessary but not sufficient for retrieval in both cases.Leibold and Kempter (2008) analyzed a number oflearning models in the feedforward setting in terms ofretrieval probabilities with a preset threshold. In Barretand van Rossum (2008) there is an attempt to identifythe optimal synaptic modification rule. The analysis islimited to a feedforward network with one binary out-put neuron and the goal is to distinguish between twocategories, learned and unlearned patterns. Capacity isrelated to the error rate of this two-class problem.

Retrieval is more difficult in the recurrent settingthan in the feedforward setting. The threshold muston the one hand control the number of non-selectiveneurons that fire and on the other hand guarantee thata large percent of the selective neurons persist in firing.In recent work, Amit and Huang (2010) and Romaniet al. (2008), a mechanism to determine this thresholdhas been proposed, in terms of the network parameters,that enables stable retrieval of patterns in asynchronousdynamics. Stability of the recurrent network especiallyin the presence of varied coding levels requires theintroduction of inhibition. The retrieval probabilitycomputation as introduced in Amit and Huang (2010)requires a more careful analysis of the noise, i.e. thevariance of the fields, which involves covariances of thesynapses. This is then used in a normal approximationto the fields (similar to Leibold and Kempter 2008 and

Barret and van Rossum 2008). In addition, the tailof a certain binomial distribution is computed, arisingfrom the approximating assumption that the fields ofselective neurons can be considered independent. Asshown in Amit and Huang (2010) and in simulationsbelow, these approximations are remarkably accuratewhen compared to simulation and allow us to obtainprecise retrieval probabilities as a function of patternage. This yields a precise capacity prediction for anysize network.

We note that once memory readout is defined asa particular outcome of neural dynamics, whether aparticular outcome in a feedforward network, or sta-bility of a pattern in a recurrent network, the notion ofinformation storage in the synapses is not relevant. Thesynaptic states are simply means to enable the correctreadout and are not addressed directly in any way.

In this paper, the retrieval probability framework isused to analyze multi-state synaptic models with binaryneurons. In the two-state case, the eigenvalues of theMarkov transition matrix can be written explicitly. Ex-cept for a few special cases, it is difficult to solve for theeigenvalues of the Markov chain in the multi-state case,not to mention expressing them in terms of the learningparameters. However, apart from explicitly expressingthe powers of the transition matrix, much of the analysisin the two-state model can be extended to the multi-state case. A general method to predict the capacity ofany local finite multi-state synapse models is given, in-cluding the hidden-state, or the so called meta-plasticitymodels. We apply this method to the sequential modelsin Fusi and Abbott (2007), and the cascade models andthe serial-state models in Ben Dayan Rubin and Fusi(2007), Fusi et al. (2005) and Leibold and Kempter(2008).

As mentioned above, Barret and van Rossum (2008)studied the problem of optimizing capacity, in the feed-forward setting, over an entire family of potentiationrules. Here we primarily focus on the more conserv-ative model of Hebbian learning, i.e. if pre and post-synaptic neurons are on potentiation occurs with someprobability. If the pre-synaptic neuron is on and thepost-synaptic neuron is off depression occurs with someprobability. In this framework we show the importanceof optimizing the ratio of potentiation and depressionprobabilities. This one parameter optimization yields,in the recurrent setting, results that are qualitativelysimilar to those of Barret and van Rossum (2008).

We note that a detailed analysis of retrieval in re-current networks with binary synapses and the morecomplex and realistic integrate-and-f ire neurons can befound in Amit and Brunel (1997a, b) and Curti et al.(2004). For reviews see Amit and Mongillo (2003) and

J Comput Neurosci (2011) 30:699–720 701

Brunel (2003). The emphasis there is the behavior ofthe time continuous stochastic dynamical system and itsstable states, mainly using mean field approximations.Not much can be found on maximal retrieval capacityfor such networks.

Our main results are summarized as follows.

1. We show that standard asynchronous dynamicswith inhibition and a properly determined thresh-old yields stable retrieval in networks of binaryneurons with discrete synapses. The threshold is setto control the proportion of non-selective neuronsthat fire. An optimal inhibition level is chosen tostabilize network behavior.

2. For any local learning rule we provide a compu-tationally efficient and highly accurate approxima-tion to the retrieval probability of a pattern asa function of its age. Capacity is defined as theexpected number of retrievable patterns over allages and is equivalent to the sum of the retrievalprobabilities over all ages.

3. Many prior studies have pointed out that ca-pacity increases with the sparseness of patterns,but sparseness must be restricted (Amit and Fusi1994; Amit and Huang 2010), especially for multi-state synapse models (Leibold and Kempter 2008).Indeed, increasing sparseness reduces the initialSNR, and below a certain level no memory traceis left. However, this can be remedied by increas-ing the ratio between the rate of depression andpotentiation.

4. For the seven families of models we consider, as thenumber of synaptic states increases, the retrievalcapacities either drop to zero or plateau at a certainlevel.

5. With a proper choice of parameters, some of themulti-state models slightly outperform the two-state synapse models, however in the low codingregimes where capacity is large the difference isnegligible. Dramatic improvement in retrieval ca-pacity beyond two-state models is not observed.In each model retrieval capacity is optimized bydecreasing the coding level down to some criticallevel, while optimizing the ratio of potentiation anddepression.

The article is organized as follows. Section 2 pro-vides a general framework for all multi-state synapsemodels, including the mean, variance, distribution ofthe field (Section 2.2), threshold selection and inhi-bition (Section 2.3), and an approximate formula forretrieval probability as a function of pattern age (Sec-tion 2.4). The accuracy of the predicted probabilities isdemonstrated on a large network of 80,000 neurons.

The retrieval probability analysis is then applied totwo sequential models in Section 3.2. We also show,numerically, the asymptotic quadratic behavior of ca-pacity as the network size increases for one of thesemodels. In Section 3.3 we treat cascade models andcompare them to the two-state model. The probabilityapproximations are further validated with simulation inSection 4. For simplicity, we demonstrate the methodfor fully connected networks with stimuli of a singlecoding level. With a slight modification, the methodcan be applied to randomly connected networks (Sec-tion 5.1), multiple-level coding stimuli (Section 5.2),and feed-forward networks (Section 5.3). In Section 6we discuss the question of observed higher coding levelsin the brain, which in Fusi and Abbott (2007) was oneof the motivations for introducing multi-state models.We also recap the main conclusions of the analysis.

2 Multi-state and hidden-state synapse models

2.1 Framework

Synaptic states In a multi-state synapse model, synapsesassume a finite number of states {α1, α2, . . . , αM}. Letw = (w(α1), . . . , w(αM)) be the vector of efficacies cor-responding to the M states. When more than onestate corresponds to the same efficacy level, the statesare called hidden states or meta states as they are notdirectly observable. Synapses moving between hiddenstates do not necessarily change efficacy but they maymodify the probability of changing the efficacy. Suchvariation in synaptic plasticity is called meta-plasticity(Abraham and Bear 1996).

We represent the state of the synapse with pre-synaptic neuron j and post synaptic neuron i using anindicator vector

Jij = (Jij1, . . . , JijM),

where Jijm = 0/1 according to whether the synapse isat state αm. The efficacy of the synapse Wij is thus thevector product Wij = JijwT .

Stimulus/Pattern For a fully-connected network of Nneurons, the input stimulus is coded as ξ = (ξ1, . . . , ξN)

where ξi = 0/1 represents the firing status of neuron iwhen stimuli ξ is present. For a given stimulus ξ theselective neurons are those with ξi = 1, the others arecalled non-selective. The sequence of the training stim-uli is denoted ξ (1), ξ (2), . . .. In this paper, the two terms“stimulus” and “pattern” are used interchangeably.


Learning rule Learning is assumed local. A synapsetransitions from state αm to αm′ with probability qkk′

mm′ ,which only depends on the firing status k′, k = 0/1of the pre- and post-synaptic neuron. Let Q11 be theM × M matrix

Q11 = (q11mm′)m,m′=1,...,M (1)

and Q01, Q10, Q00 are defined accordingly.

Network dynamics We use simple asynchronous dy-namics, updating one randomly selected neuron at atime. If the current state of the network is denoted asx(t) = (x(t)

1 , . . . , x(t)N ) and if neuron i is being updated we

have

x(t+1)

i ={

1 if∑

j: j�=i Wijx(t)j − η

∑j x(t)

j > θ

0 Otherwise,(2)

and x(t+1)

j = x(t)j for all other neurons. We assume that

learning does not occur during the dynamics, i.e., {Wij}is fixed. The choices of the threshold θ and the inhibi-tion factor η are detailed in Section 2.3. The quantity

hi(x; W) =∑

j: j�=iWijx j − η

∑jx j

is called the f ield of neuron i, in which W = (Wij) is thesynaptic efficacy matrix, and

∑j: j�=i Wijx j is the synaptic

input to neuron i. Note an additional inhibitory input−η

∑j x j is added to the conventional form of the field.

The reason is detailed in Section 2.3.Let J(p)

ij be the state of the synapse from neuron j to i

after learning ξ (1), ξ (2), . . . , ξ (p) and let W(p) = (W(p)

ij ) =(J(p)

ij wT). Whether the first pattern ξ (1) can be retrieveddepends on whether it is a stable fixed point of thedynamics with W = W(p), i.e, that hi(ξ

(1); W(p)) > θ ifξ

(1)

i = 1, and < θ if ξ(1)

i = 0. Practically, the retrievedpattern does not have to be exactly the same as thestored pattern. If initialized with ξ (1), the network sta-bilizes at a pattern sufficiently close to ξ (1), then theretrieval is considered successful. Table 1 summarizesthe notation used in the paper.

2.2 Mean, variance, and distribution of the field

For notational simplicity, we write hi(ξ(1); W(p)) as h(p)

i ,i.e.,

h(p)

i =∑

j: j�=iW(p)

ij ξ(1)

j − η∑

jξ

(1)

j .

Let

M(p)x := E[h(p)

i |ξ (1)

i = x](R(p)

x )2 := Var(h(p)

i |ξ (1)

i = x)

be the mean and variance for the field of selective (x =0) and non-selective neurons (x = 1).

Table 1 Notation

f Coding levelm Number of synaptic statesη Inhibition factorθ ThresholdCδ, f (1 − δ f

1− f )-quantile of the standard normalJij Indicator vector for state of synapse from j to i

W(p)

ij Synaptic weights from j to i at step pq+, q− Potentiation and depression probability factorsτ Potentiation to depression ratioξ (p) p’th learned patternh(p)

i The field on neuron i at step p produced by ξ (1)

μ(p)x , σ

(p)x Mean and SD of W(p)

ij when ξ(1)i = x = 0/1, ξ

(1)j = 1

ρ(p)x Cov. of W(p)

ij , W(p)

ik when ξ(1)i = x, ξ

(1)j = ξ

(1)

k = 1

π Stationary distribution of synaptic state J

M(p)x Mean of field at step p when ξ

(1)i = x = 0/1

R(p)x SD of field at step p when ξ

(1)i = x = 0/1

δ Fraction of non-selective neurons allowed to fireε Allowed error rate in retrievalP(p)

ε Retrieval probability with error ε at step p

For all variables with p superscript, when superscript is removedwe refer to asymptotic value p = ∞

For simplicity, assume the training stimuli are ran-dom patterns of a single coding level f , that is, neuronswill be selective independently with probability f (SeeSection 5.2 for the multiple-level coding case). Condi-tional on the number of selective neurons

∑j ξ

(1)

j = n

of the first learned pattern, M(p)x and (R(p)

x )2 can beevaluated through the following four steps.

1. In terms of the four Q-matrices (Eq. (1)), define

P1 = f Q11 + (1− f )Q10, P0 = f Q01 + (1− f )Q00

(3)

and the two transition matrices

P = f P1 + (1− f )P0, (4)

S = f P1 ⊗ P1 + (1− f )P0 ⊗ P0. (5)

Here ⊗ denotes the Kronecker product (seeAppendix A.1). Find the stationary distributions π

and γ of the transition matrices P and S

π P = π, γ S = γ.

2. For x = 0, 1, calculate the p-step distribution

π(p)x := E[J(p)

ij |ξ (1)

i = x, ξ(1)

j = 1],γ (p)

x := E[J(p)

ij ⊗ J(p)

ik |ξ (1)

i = x, ξ(1)

j = ξ(1)

k = 1]


iteratively as follows:

π(1)x = π Qx1, γ (1)

x = γ (Qx1 ⊗ Qx1), (6)

π(p)x = π(p−1)

x P, γ (p)x = γ (p−1)

x S for p = 2, 3, . . .

(7)

3. Calculate the p-step mean, variance, and covari-ance of the synaptic efficacy as

μ(p)x := E[W(p)

ij |ξ (1)

i = x, ξ(1)

j = 1] = π(p)x wT (8)

(σ (p)x )2 := Var(W(p)

ij |ξ (1)

i = x, ξ(1)

j = 1)

= wDiag(π(p)x )wT − (μ(p)

x )2 (9)

ρ(p)x := Cov(W(p)

ij , W(p)

ik |ξ (1)

i = x, ξ(1)

j = ξ(1)

k = 1)

= γ (p)x (w ⊗ w)T − (μ(p)

x )2. (10)

Here Diag(π(p)x ) is the diagonal matrix with π

(p)x on

the diagonal.4. Finally,

M(p)x = n(μ(p)

x − η),

(R(p)x )2 = n(σ (p)

x )2 + n(n − 1)ρ(p)x

(11)

See Appendix A.2 for justification.Since π

(p)x → π , γ

(p)x → γ , the stationary mean, vari-

ance, and covariance of the synaptic efficacy are

μ := μ(∞)x = πwT , (12)

σ 2 := (σ (∞)x )2 = wDiag(π)wT − μ2 (13)

ρ := ρ(∞)x = γ (w ⊗ w)T − μ2 (14)

and the stationary mean and variance of the fields are

M(∞)1 = M(∞)

0 = n(μ − η),

R2 := (R(p)x )2 = nσ 2 + n(n − 1)ρ.

Though our capacity prediction is not based on thesignal-to-noise ratio (SNR), it does provides insight intothe capacity of models. The signal is the differencebetween the mean fields of selective neurons and thestationary mean field M(p)

1 − M(∞)x = n(μ

(p)

1 − μ). TheSNR is the ratio of the squared signal to the stationaryvariance of the field R2.

SNR = n2(μ(p)

1 − μ)2

nσ 2 + n(n − 1)ρ≈ n(μ

(p)

1 − μ)2

σ 2(15)

The magnitude of the covariance ρ and ρ(p)x can be

shown to be on the order of the coding level of thestimuli (see Supplementary Material), and hence theyare often ignored in the SNR analysis (e.g. Amit and

Fusi 1994; Fusi and Abbott 2007; Fusi et al. 2005;Leibold and Kempter 2008). In the retrieval probabilityframework this leads to capacity overestimation, espe-cially when the coding level is large.

The exact p-step distribution of the field is derivedin the Supplementary Material, but involves (M − 1)-dimensional integration, where M is the number ofsynaptic states. This is computationally cumbersomeespecially for large M. Instead we approximate thedistribution of the field h(p)

i by a normal distributionwith mean M(p)

x and variance (R(p)x )2, conditional on the

selectivity x of neuron i and the number of selectiveneurons

∑j ξ

(1)

j = n of the first pattern. Simulationsshow this approximation does not lead to loss in ac-curacy in predicting the retrieval probabilities. Similarapproximations were used in Leibold and Kempter(2008) but ρ and ρ

(p)x were not included.

2.3 Inhibition and threshold selection

Successful retrieval of a pattern in network dynamicsdepends on the threshold. If the threshold perfectlyseparates the fields of the selective and non-selectiveneurons, the firing pattern will clearly sustain itself.Without inhibition (i.e., η = 0), from Eq. (11) one cansee that the fields of both selective and non-selectiveneurons grow with initial pattern size n = ∑

j ξ(1)

j . Theseparating threshold must grow with n as well, ormemory retrieval will fail. Clearly the firing thresh-old cannot be expected to depend on the size of thepattern being retrieved. Consequently, only patternswith size in a very limited range can be retrieved.Inhibition is thus introduced. We assume the inhibitoryneurons reduce the field of the excitatory neurons byan amount proportional to the number of active excita-tory neurons. For simplicity, the learning of excitatory-inhibitory and inhibitory-inhibitory synapses is notconsidered.

The primary consideration in setting the threshold isminimizing false positives, namely the probability of anon-selective firing. The pool of non-selective neuronsis much larger and if even a small proportion firesthe original pattern will be swamped. Consequentlythe threshold must be several standard deviations (SD)above the mean field of non-selective neurons. Theasymptotic SD of the mean field conditional on the pat-tern size is R = √

nσ 2 + n(n − 1)ρ. Thus the thresholdwill be of the form

θ = n(μ − η) + CR = n

[μ − η + C

√ρ + σ 2 − ρ

n

].

(16)


This however grows linearly with n. As in Amit andHuang (2010), by choosing

η = μ + C√

ρ, (17)

the threshold

θn = nC

(√ρ + σ 2 − ρ

n− √

ρ

)= C(σ 2 − ρ)√

ρ + σ 2−ρ

n + √ρ

increases as√

n and is bounded above by C(σ 2−ρ)

2√

ρ. Al-

though the threshold still increases with n, it can beshown that the number of false positives is insensitiveto the choice of n and hence is still under control (seeSupplementary Material). Since the average patternsize is n = N f , we set the threshold to be

θ = θN f = C(σ 2 − ρ)√ρ + σ 2−ρ

N f + √ρ

(18)

It remains to determine the constant C. Assume thefields of the non-selective neurons are normal withmean μ

(p)

0 ≈ μ and variance (R(p)

0 )2 ≈ R2, the expectednumber of non-selective neurons above the threshold isapproximately

N(1− f )(

1−

(θ−n(μ−η)

R

))= N(1− f )(1− (C))

where is the distribution function of the standard nor-mal distribution. Since we want to keep the expectednumber of non-selective neurons at a fraction δ of thenumber of selective neurons, and the average numberof selective neurons is N f , we require N(1 − f )(1 − (C)) ≤ δN f . Thus C can be chosen to be the Cδ, f =(1 − δ f

1− f )-quantile of the standard normal. Though theactual distribution of the field is not normal, this ap-proximation works well as long as N f is not too small.A rule of thumb is N f > 30. Below are values of Cδ, f

for some f and δ

f 0.005 0.01 0.02 0.05 0.1δ = 0.005 4.05 3.89 3.71 3.47 3.26δ = 0.01 3.89 3.72 3.53 3.28 3.06

In Amit and Huang (2010) we used Cδ,N = (1 −δ)1/N-quantile of the standard normal, which is tooconservative as it is unnecessary to keep the fieldsof all non-selective neurons below the threshold. Themodified value leads to higher capacity levels.

2.4 Retrieval probability and network capacity

Before calculating the retrieval probabilities, a quanti-tative description of retrieval is required. Suppose ξ is

a pattern with n selective neurons. When the networkis initialized with ξ , we say ξ is successfully retrieved,if after the dynamics has stabilized, the number ofselective neurons that are on stays larger than (1 − ε)n.1

Since it is difficult to analyze the behavior of thedynamics, we approximate the retrieval by an eventthat depends only on properties of the field of the initialinput. Given

∑j ξ

(1)

j = n, the field induced by (1 − ε)nof the selective neurons needs to keep the field of atleast (1 − ε)n selective neurons above threshold. If thisis true, then the number of active selective neurons islikely to stay equal or above (1 − ε)n throughout thedynamics.

The field induced by n′ = (1 − ε)n selective neuronshas mean n′(μ(p)

1 − η) and variance n′(σ (p)

1 )2 + n′(n′ −1)ρ

(p)

1 . By the normal approximation, the probabilitythat this field is above θ is,

�(p)n,ε =

⎛⎝ θ − n′(μ(p)

1 − η)√n′(σ (p)

1 )2 + n′(n′ − 1)ρ(p)

1

⎞⎠ (19)

This approximation works well when n′ = n(1 − ε) issufficiently large.

The probability that the fields of at least (1 − ε)nof the selective neurons are above the threshold isapproximated as

P(p)n,ε ≈

∑k≥n(1−ε)

(nk

)(�(p)

n,ε )k(1 − �(p)n,ε )n−k. (20)

Here we have also assumed independence of the fieldsof different neurons. Strictly speaking these fields arenot independent but their dependence is weak whenf is small. As the patterns are of coding level f , theprobability that ξ (1) is of size n is

(Nn

)f n(1 − f )N−n. The

probability to retrieve ξ (1) after learning ξ (1), . . . , andξ (p) is

P(p)ε =

N∑n=0

P(p)n,ε

(Nn

)f n(1 − f )N−n (21)

If we use the expected number of retrievable patternsas the quantitative measure of network capacity, thecapacity is simply the sum of the retrieval probabilitiesover all ages p

Capacity =∑∞

p=1P(p)

ε (22)

1The fraction of non-selective neurons above the threshold isnot specified in the criterion since it has been controlled in thethreshold selection. Moreover, because of the strong inhibition,there cannot be many non-selective neurons above thresholdthroughout the dynamics.


Note that if the synaptic efficacies of all states aretransformed as w → aw + b the probabilities P(p)

ε andthe capacity remains the same. As the synaptic efficacychanges scale, the inhibition factor η and firing thresh-old θ change accordingly, and the retrieval probabilityis unaffected. This is not surprising as the capacity of amodel should not change with the units used to measurethe synaptic efficacy.

In Section 4 we present retrieval probability predic-tions for a number of different synaptic modificationmodels and illustrate the accuracy of these predictionswith respect to simulations.

As an initial example, in Fig. 1 we show the accuracyof the retrieval probabilities for a network of 80,000neurons with binary synapses, where the parametershave been set in such a way that retrieval capacity isaround 94,000. In red are the predicted probabilities,in black is the optimal monotone regression on proba-bilities estimated from ten simulations of this network.

0 20000 60000 100000

0.0

0.2

0.4

0.6

0.8

1.0

Ret

rieva

l Pro

babi

lity

ε = 0.05

0 20000 60000 100000

0.0

0.2

0.4

0.6

0.8

1.0

Ret

rieva

l Pro

babi

lity

ε = 0.01

Fig. 1 Predicted retrieval probability for a large network N =80,000 as a function of pattern age based on Eq. (21) (red) vs.retrieval probabilities estimated from ten simulated networks. x-axis: age of patterns (p), y-axis: retrieval probability P(p)

ε . Top:error ε = .01, Bottom: error ε = .05. Black shows best monotonefit to simulation probabilities. In each run, 120,000 patters arelearned and retrieval is assessed using asynchronous dynam-ics. f = .002, τ = 1.141, η = .514, θ = .00024. Estimated and pre-dicted capacity is 94,000

Note the very close agreement between prediction andsimulation for two values of ε—the allowable retrievalerror. In the binary synapse case, using bitwise codingof synaptic and neural states it is possible to simu-late such a large network rather efficiently. Learning120,000 patterns and testing their retrieval takes a fewhours on a 12 core PC.

2.5 Comparison of the SNR analysis and the retrievalprobability approach

The behavior of multi-state synaptic models is muchricher than the two-state models, but they are oftendifficult to analyze algebraically. The retrieval prob-ability approach described above provides a compu-tationally efficient method to numerically predict thecapacity with high precision. Given the parameters, thecapacity can be computed in seconds. This is particu-larly helpful for networks that are too large to simulatein computers.

Retrieval probability and SNR analyses do not al-ways agree, however, large SNR is a necessary condi-tion for retrieval. From Eqs. (15) and (16), observe thatSNR < C2

δ, f implies

n(μ(p) − η) < θn ≤ θN f = θ

if the pattern size n ≤ N f (note ρ is always non-negative). Hence from Eq. (19), �

(p)n,ε < 0.5. Note that

P(p)n,ε ≈ 0 as long as �

(p)n,ε < 0.9. Since the coding level is

fixed, the pattern size will not be far from N f . For nslightly larger than N f , �

(p)n,ε is still less than 0.9 and

hence P(p)n,ε ≈ 0. Thus small SNR (< C2

δ, f ) will implylittle or no retrieval capacity.

Although the p-step SNR is often intractable, the al-gebraic form of the initial SNR is sometimes available.From the argument above, initial SNR < C2

δ, f implieslittle or no capacity. In some examples in Section 3,we will use the initial SNR to motivate some parame-ter choices and explain some results obtained via theretrieval probability approach.

3 Examples

In this section we analyze in detail a number of synapticmodification models. All capacity results reported aregiven in terms of the expected number of retrievablepatterns (Eq. (22)) with δ = 0.01, ε = 0.05. Also thevalue of τ is optimized with respect to this measureof capacity. In some of the models a signal-to-noiseanalysis helps motivate some of the parameter choices.


3.1 Two-state synapses

Two-state synapse models have been analyzed in detailin Amit and Fusi (1994), Amit and Huang (2010) andRomani et al. (2008). We briefly summarize the resultsto compare with other models. In this model, a synapseis either potentiated or depressed, with efficacies W−or W+, i.e. w = (W−, W+). As the memory capacity isscale and shift invariant, we simply assume W− = 0,W+ = 1. A depressed synapse has probability q+ to bepotentiated when both the pre- and postsynaptic neu-rons are active, and a potentiated synapse has probabil-ity q− to be depressed when the presynaptic neuron isactive and the postsynaptic neuron is silent. Otherwisethe efficacy is unchanged. The four Q-matrices are thus

Q11 =[

1 − q+ q+0 1

], Q01 =

[1 0

q− 1 − q−

],

Q10 = Q00 = I2,

where In is the n × n identity matrix. Suppose q− =τ fq+/(1 − f ). Then π = (π0, π1) = ( τ

1+τ, 1

1+τ), μ = π1,

σ 2 = π0π1,

μ(p)

1 = π1 + λp−1π0q+, μ(p)

0 = π1 − λp−1π1q−

where λ=1− f 2q+− f (1− f )q− =1−(1+τ) f 2q+. Inthis model, the algebraic form of SNR is available,

SNR ≈ n(μ(p)

1 − μ)2

σ 2= nλ2p−2π2

0 q2+π1π0

= nτq2+λ2p−2.

As the stimuli coding level is f , the average patternsize is N f , and hence SNR ≈ N fλ2p−2τq2+. Requiring

SNR ≥ C, for some constant C, gives a rough relation-ship between the capacity and model parameters,

p <ln(N fτq2+/C)

−2 ln λ≈ ln(N fτq2+/C)

2 f 2q+(1 + τ), (23)

since − ln λ=− ln(1− f 2q+(1+τ))≈ f 2q+(1+τ) whenf is small.

A few facts can be observed from Eq. (23). First,for fixed q+ and τ , the capacity is O( f −2). Second, theargument of the logarithm in Eq. (23) must be greaterthan 1, i.e., the SNR at the first step given by N fτq2+must be greater than C, otherwise the capacity is 0.Usually this is viewed as a constraint on the sparsenessof the stimuli, especially for slow learning (q+ small).However, this constraint can be removed by properlyadjusting τ to keep the initial SNR above C. Theproblem is that it is unclear what the constant C shouldbe. With retrieval capacity (Eq. (22)) all constants aredetermined by retrieval criteria and it is possible tooptimize τ for fixed values of f and q+. For fixed τ ,the capacity initially increases as f decreases, peaks ata certain f and then drops to 0 (Fig. 2(a)). When τ isoptimized sparser stimuli can be used, and the capacitycan be further increased. In Fig. 2(b) τ is optimized forretrieval capacity, which is considerably improved.

Another limit on sparseness that arises when usingthe retrieval probability analysis is the normal approx-imation (Eq. (19)). This works well only when n ≈ N fis not too small. Thus we restrict N f ≥ 30. Moreover, τ

cannot be raised arbitrarily since for two-state synapsesthe distribution of the mean field is a mixture of Bino-mial distributions (Amit and Huang 2010). For the nor-mal approximation to work, nπ1 ≈ N f/(1 + τ) cannot

01

00

03

00

05

00

0

Coding Levels (f)

Me

mo

ry C

ap

acity

(a) τ = 1

0.003 0.01 0.05 0.1

q ++ == 1

q ++ == 0.7

q ++ == 0.5

q ++ == 0.3

020

00

60

00

10

00

0

Coding Levels (f)

Me

mo

ry C

ap

acity

(b) τ optimized

0.003 0.01 0.05 0.1

q ++ == 1

q ++ == 0.7

q ++ == 0.5

q ++ == 0.3

02

46

810

12

14

Coding Levels (f)

Op

tim

al

τ

(c)

q ++ == 1

q ++ == 0.7

q ++ == 0.5

q ++ == 0.3

0.003 0.01 0.05 0.1

Fig. 2 Capacity of the two-state synapse model for N = 10,000.(a) When τ is fixed at 1, the capacity initially increases as fdecreases, peaks at a certain f and then drops to 0. The peakcapacity decreases rapidly with q+, and the optimal f increasesas q+ decreases. For q+ ≤ 0.3, the capacity is 0 for all f . (b) As

τ is optimized for each f and q+, the capacity for each f and thepeak capacity are increased considerably. Even for q+ = 0.3, thecapacity can reach 1,000. (c) The corresponding optimal τ of (b).Note we restrict τ outside the region nπ1 = N f/(1 + τ) < 5 (grayarea)


be too small. We restrict τ in the region N f/(1 + τ) ≥ 5(Fig. 2(c)).

Intuitively smaller τ implies lower probability ofdepression (q−), or slow forgetting of learned patterns,and hence the capacity increases. However, decreasingτ will raise the field of background neurons (recallπ1 = 1

1+τ), reduce the contrast between the selective

and background neurons, and hence hinder memoryretrieval. Properly obliterating old patterns and keep-ing the portion of potentiated synapses at a reasonablelevel is necessary for the network to distinguishing theselective neurons from the non-selective neurons.

3.2 Sequential models

In a sequential model, the synaptic efficacies—w—take m + 1 discrete values 0, 1

m , 2m , . . . , m

m , equispacedbetween 0 and 1. Whenever a synapse with efficacyw is potentiated or depressed, its synaptic efficacy in-creases or decreases by 1/m, with probability q+κ+(w),q−κ−(w) respectively. Here q+ and q− are two scalarparameters between 0 and 1. We assume a synapse ispotentiated only when both pre- and postsynaptic neu-rons are active, and depressed only when the presynap-tic neuron is active and postsynaptic is silent. In thenotation in Section 2.2, the state space is {0, 1, . . . , m} withefficacy vector w = (0, 1

m , 2m , . . . , m

m ). The Q-matricesare Q00 = Q10 = Im+1, Q11 = Im+1 + q+ D+, and Q01 =Im+1 + q− D−, in which, Ik denotes a k × k identitymatrix, and D+, D− are bidiagonal matrices given by

D+ =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

−κ+(0) κ+(0)

−κ+(

1

m

)κ+

(1

m

). . .

. . .

−κ+(

m − 1

m

)κ+

(m − 1

m

)0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

,

D− =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0

κ−(

1

m

)−κ−

(1

m

). . .

. . .

κ−(

m − 1

m

)−κ−

(m − 1

m

)κ−(1) −κ−(1)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

.

We now explore how different plasticity modificationsκ+(w) and κ−(w) affect the memory performance.

3.2.1 Hard-bound models

In a hard-bound model, κ+(w) and κ−(w) are indepen-dent of the efficacy w,

κ+(w) = κ−(w) = 1

for 0 < w < 1, and at the boundary w = 0 and 1,any transitions that would move the synaptic efficacyoutside the range [0, 1] are truncated, i.e. κ−(0) =0, κ+(1) = 0. This model is equivalent to the one-dimensional random walk on integers with two barriersat 0 and m. Suppose q− = τ fq+/(1 − f ). The case τ = 1corresponds to the symmetric random walk, since themarginal probability of potentiation f 2q+ and depres-sion f (1 − f )q− are equal. The stationary distributionof the symmetric random walk is uniform. For τ �= 1,the stationary distribution is truncated geometric. Themean, variance, and initial SNR can be easily derivedand are listed in Table 2 in the Appendix.

For τ = 1, the initial SNR ≈ 12N f q2+(m+1)2 decreases

quadratically with m. As m increases, the SNR soonfalls below C. Hence the capacity is zero unless m issmall. These observations are confirmed in terms ofretrieval capacity in Fig. 3(a). For τ < 1, the initial SNRis also too small to maintain any memory.

From Fig. 3(b), for τ > 1, one can see that the capac-ity is nearly constant in m as m gets large. Actually as mgets large, the model approaches the asymmetric simplerandom walk on non-negative integers 0, 1, 2, . . ., witha single barrier at 0. The stationary state of the modelis the geometric distribution

π = (1 − τ−1)(1, τ−1, τ−2, . . .).

Most synapses stay at the lowest few levels. Higher lev-els are rarely visited and have little effect on learning.Further increasing the number of synaptic levels doesnot change the model very much.

Like the two-state model, we find the optimal τ

that maximize the capacity for each m and f given.Figure 3(c and d) shows the result for N = 10,000,m = 1, 2, 3, 10, and .003 ≤ f ≤ .1 in which Fig. 3(c) isthe optimal capacity and Fig. 3(d) is the correspondingoptimal τ . The case m = 1 reduces to the two-statemodel. For low coding levels, the 3-level model (m = 2)is the best, at approximately 3-25% higher capacitythan the optimal two-state model. More synaptic lev-els do not further increase capacity. For larger codinglevels, the optimal hard-bounds model is no better thanthe optimal two-state model.

We can also use the capacity formula to numeri-cally analyze the asymptotic capacity as N increases.In Fig. 3(e) we show the square root of the predicted


010

0020

0030

0040

0050

00

0.003 0.01 0.05 0.1

(a) τ = 1

m = 1m = 2m = 3m = 4m = 5

5 10 15

050

015

0025

0035

00

τ = 2

τ = 4τ = 5τ = 6

τ = 8

τ = 3

τ = 1

τ = 0.5

(b) τ ≠ 1f = 0.01

Coding Levels (f)

Opt

imal

Mem

ory

Cap

acity

(c) optimal capacity

0.003 0.01 0.05 0.1

101

102

103

104

m = 1 (binary)m = 2 m = 3 m = 10

12

34

56

Coding Levels (f)

Opt

imal

τ(d) the optimal τ

0.003 0.01 0.05 0.1

m = 1m = 2m = 3m = 10

010

0020

0030

0040

00

Network Size (N)

Squ

are−

root

of C

apac

ity

(e)

0 100 300 500 (x1000)

Fig. 3 Hard-bound model, N = 10,000: (a) For τ = 1, the capac-ity decreases rapidly with m and is nonzero only when m is small.(b) When τ < 1, the capacity quickly drops to 0 as m increases.When τ > 1, the capacity is nearly constant in m when m is large.(c) The optimal capacity of the hard-bound model for N = 10,000,

m = 1, 2, 3, 10, and .003 ≤ f ≤ .1, optimized over τ . (d) Thecorresponding optimal τ . (e) Square root of capacity as a functionof N for the hard bound model. m = 5, f = 4N/ log N and τ = 5,for N = 10, 20, 40, 80, . . . , 640 × 103. The slope is .008

0 10 20 30 40 50

01000

2000

3000

Number of Synaptic levels (m+1)

Mem

ory

Capacity

(a) f == 0.01

τ = 50

τ = 20

τ = 10

τ = 5

τ = 2

τ = 1

5 10 15 20 25 30 35

1000

2000

3000

Number of Synaptic levels (m+1)

Mem

ory

capacity

(b) f == 0.01, τ = m ν

ν = 0.5

ν = 0.75

ν = 1 ν = 1.25

ν = 2

Coding Level (f)

Optim

al C

apacity

(c) Optimal Capacity

0.003 0.01 0.05 0.1

101

102

103

104

m = 1 (binary)

m = 2

m = 5

m = 10

m = 15

Fig. 4 (a) The capacity of the soft-bound model as a function ofm, for a number of values of τ with N = 10,000 and f = 0.01.(b) The capacity of a soft-bound model for m = 1 to 30 andτ = νm is kept proportional to m, when N = 10,000, f = 0.01.

The capacity hardly changes with m after m > 10, except forν = 0.5. (c) The optimal capacity of the soft-bound model forN = 10,000, m = 1, 2, 5, 10, 15, and .003 ≤ f ≤ .1, optimized overν = mτ


capacity as a function of N for the hard-bound modelwith m = 5, where f = 4N/ log N and τ = 5. We seethe nearly perfect linear fit with a slope of .008.

3.2.2 Soft-bound model

Instead of truncating the efficacy at the boundaries, asoft-bound model allows κ+(w) and κ−(w) to vary withw, gradually vanishing at the boundaries,

κ+(w) = 1 − w, κ−(w) = w.

Assuming q− = τ fq+/(1 − f ), the stationary distrib-ution is Binomial

(m, 1

1+τ

). The mean, variance, and

initial SNR are listed in Table 2 in the Appendix. Since

the initial SNR = nτq2+m decreases with m, for fixed τ

the memory capacity must decrease to 0 for large m.Figure 4(a) confirms this observation. Moreover, thisplot also shows that for fixed τ , the model capacity firstincreases and then decrease with m, and the optimal mseems to be proportional to τ .

Indeed, for fixed f , the model capacity is mostlyaffected by the ratio ν = τ/m, at least when m is large.In Fig. 4(b), the capacity plateaus as m increases whilekeeping the ratio ν constant. This is not surprising sincethe binomial distribution is well-approximated by thePoisson distribution with mean 1/ν when τ = νm for mlarge,

πi ≈ e1/ν

vi i! .

Moreover, the probability of depression from state i toi − 1 depends on ν only

f (1 − f )q− × (i/m) = i f 2q+ν.

When m is large relative to i, the probability of potenti-ation from state i to i + 1 is scarcely affected by m,

f 2q+ × (1 − i/m) ≈ f 2q+.

For ν > 1, most of the mass of Poisson(1/ν) concen-trates on the lowest few synaptic levels. Higher levelsare rarely visited and hence not influential.

Figure 4(c) shows the optimal capacity for each cod-ing level f and m = 1, 2, 5, 10, and 15 where ν is chosento maximize the capacity. Note the optimal ν might beslightly different for each m. The case m = 1 reduces tothe two-state model. For small f , the capacity of theoptimal soft-bound model with ten levels is 10–25%better than the optimal two-state model; for larger f ,the optimal soft-bound are no better than the optimaltwo-state model.

In this model, the lowest level is most efficient inlearning. Since κ+(0) = 1, all synapses in this level willbe potentiated once the conditions for potentiation aremet. After the lowest level, the second lowest level isthe level where depression is least likely to happen, andhence the trace of learned patterns could be preservedlonger. In an optimal soft-bound model, most synapsesstay at these two levels for effective learning and slowforgetting. This explains the resemblance of the optimalsoft-bound models and the two-state models.

3.3 Hidden-state models

In a hidden state model, there may be multiple statescorresponding to each level of efficacy. We considerthe case of two levels of efficacy W− and W+ only.Recall the capacity is scale invariant (Section 2.4).We can assume W− = 0, W+ = 1. Figure 5 depicts the

(A) Cascade

−m

−3

−2

−1

m

3

2

1

w = 0 w = 1

p 1 p1

q−2 q2

p−2p2

q−3 q3

p−(m 1) pm−1q−m qm

q−1

q1

(B) Modified Cascade

−1

m

3

2

1

w = 0 w = 1

p1

q2

p2

q3

pm−1qm

q 1

q1

(C) Serial

−m

−3

−2

−1

m

3

2

1

w = 0 w = 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

(D) Modified Serial

−1

m

3

2

1

w = 0 w = 1

1

1

1

1

1

1

1

1

Fig. 5 Structure of the four hidden-state models. In all fourmodels there are only two efficacy levels, w = 0 (depressed,gray circles) and w = 1 (potentiated, white circles), each with

a number of hidden states. The black/gray arrows indicate theallowable transitions for LTP/LTD, and the numbers next toarrows indicate transition probabilities


structure of four hidden-state models we consider. Theblack/gray arrows indicate the allowable transitionswhen LTP/LTD happens, and the numbers next to thearrows indicate the transition probabilities.

3.3.1 Cascade model

The cascade model is proposed in Ben Dayan Rubinand Fusi (2007). There are m hidden states for bothlevels of efficacy, denoted as −1, −2, . . ., −m and 1,2, . . ., m. If a synapse in the high/low level of efficacyis further potentiated/depressed, it does not changeefficacy but becomes more resistant to efficacy change.The transition probabilities q1 > q2 > . . . > qm, q−1 >

q−2 > . . . > q−m are usually an exponentially decreas-ing sequence, reflecting biochemical processes operat-ing on multiple timescales.

In our notation, the efficacy vector is w = (0, 0,

. . . , 0, 1, 1, . . . , 1). The transition matrices for poten-tiation and depression can respectively be written asI2m + D+ and I2m + D− where

D+ =(

B−m C−

m0 A+

m

), D− =

(A−

m 0C+

m B+m

)

Here A±m is the bi-diagonal matrix

A±m =

⎛⎜⎜⎜⎜⎜⎝

−p±1 p±1

−p±2 p±2

. . .. . .

−p±(m−1) p±(m−1)

0

⎞⎟⎟⎟⎟⎟⎠

and C±m is the matrix with (q±1,. . .,q±m)T in the first

column and 0 elsewhere. B±m is the diagonal matrix with

(−q±1,. . ., −q±m) in the diagonal.Correlations among synapses, though negligible for

sparsely coded patterns, could considerably inflate thenoise level in retrieving densely coded patterns. A spe-cial learning rule to remove synaptic correlation is pro-posed in Ben Dayan Rubin and Fusi (2007): a synapseis potentiated when both the pre- and postsynapticneurons are active or silent, with probability q+ andq0 = f 2q+

(1− f )2 respectively; otherwise, it is depressed with

probability q− = τ fq+1− f . In our notation, Q11 = I2m +

q+ D+, Q01 = Q10 = I2m + q− D−, Q00 = I2m + q0 D+.Using this learning rule, the stationary synaptic covari-ance ρ defined in Eq. (14) will be 0 (see Appendix A.3for a proof). Note that the non-stationary covarianceρ(p) is still nonzero, but the size is reduced considerably.

Figure 6 gives the capacity of the cascade modelfor q±i = p±i = 2−i, for i = 1, . . . , m − 1, and q±m =2−m+1 as in Ben Dayan Rubin and Fusi (2007). Thecapacity decreases rapidly as the number of hiddenlevels increases, and complex models have non-zerocapacity only when the coding level is large (Fig. 6(a)).This phenomenon is still present even if the model isoptimized over τ (Fig. 6(b)). As pointed out in Leiboldand Kempter (2008), this is because the initial SNR ofthe cascade model decreases with the complexity of thesynapses. However, as the low initial signal

μ(1) − μ =m∑

i=1

π−iq−i

is only due to the low potentiation probability q−i ofsome depressed levels, this problem can be resolvedby reducing the number of depressed states. The struc-ture of the modified cascade model is depicted inFig. 5(b), there is only one depressed state but m poten-tiated states. This modification preserves the multipletimescale characteristics of the cascade model. In this

0.003 0.01 0.05 0.3

10

100

1000

m == 1

m == 2

m == 3

m == 4

m == 5

m == 6

(a) Cascade, τ =1

0.003 0.01 0.05 0.3

10

100

1000

m = 1

m = 3

m = 5

m = 8

(c) Modified Cascade, τ =1

(b) Cascade, τ optimized

m == 1

m == 2

m == 3

m == 4

m == 5

m == 6

m == 7

0.003 0.01 0.05 0.3

10

100

1000

10000

0.003 0.01 0.05 0.3

10

100

1000

10000

m = 1

m = 3

m = 5

m = 20

(d) Modified Cascade

τ optimized

Fig. 6 Memory capacity of the cascade and modified cascademodel for N = 10,000, q±i = p±i = 2−i, for i = 1, . . . , m − 1, andq±m = 2−m+1. x-axis: coding level f ; y-axis: memory capacity.(a) The original cascade model: when τ = 1, the capacity dropsrapidly as the number of hidden states m increases. (b) Evenwhen the model is optimized over τ , the capacity still decreases asm increases. (c) When the number of depressed state is reducedto 1, the capacity is greatly improved, but still decreases with thenumber of hidden states. (d) When optimized over τ , the optimalmodified cascade model is as good as the optimal two-state model


case, w = (0, 1, 1, . . . , 1), Q11 = Im+1 + q+ D+, Q01 =Q10 = Im+1 + q− D−, Q00 = Im+1 + q0 D+, where

D+ =(−1 e1

0T A+m

), D− =

(0 0

cm B+m

).

Here e1 = (1, 0, . . . , 0), cm = (q1, . . . , qm)T , and 0 =(0, . . . , 0). The capacity is greatly improved, but stilldecreases with the number of hidden levels (Fig. 6(c)).This is because when τ = 1, the stationary distributionis uniform over all states (Table 2), μ = m

m+1 , which ishigh when m is large. As in the two-state model, whenthe fraction of potentiated synapse is high, the contrastbetween the selected and the background neurons mustbe low and hence retrieval is difficult. This problemis resolved when τ is optimized (Fig. 6(d)). However,synapses of all complexity are as good as the two-statesynapses.

3.3.2 Serial-state model

Another hidden-state model studied in Leibold andKempter (2008) has a simpler structure where all synap-tic states are connected serially and all transition prob-abilities equal one (Fig. 5(c)). As in the cascade wealso study a modified serial-state model with only onedepressed hidden state (Fig. 5(d)). Using the decorre-lating learning rule, the Q matrices are of the sameform as in the cascade model, but the D+ and D− arereplaced by the D+ and D− of the hard-bound model.The summary of the two models are given in Table 2 inthe Appendix.

For small f , the serial-state model also suffers fromthe problem of small initial SNR as m gets large(Table 2). The retrieval capacity soon drops to zero asm increases, even when optimized over τ (Fig. 7(ab)).On the other hand, serial-state model does improve

0.003 0.01 0.05 0.1 0.3

10

100

1000

10000

m = 1

m = 2

m = 3

m = 4

m = 5

m = 6

m = 7

m = 8

m = 9

(a) Serial, τ =1

0.003 0.01 0.05 0.1 0.3

10

100

1000

10000

m = 1

m = 2

m = 3

m = 4

m = 5

m = 10

m = 15

m = 20

(c) Modified Serial, τ = 1

0.003 0.01 0.05 0.1 0.3

10

100

1000

10000

m = 1

m = 2

m = 3

m = 4

m = 5

m = 6

m = 7

m = 8

m = 9

(e) Modified Serial, Non−decorrelated

τ = 1

(b) Serial, τ optimized

0.003 0.01 0.05 0.1 0.3

10

100

1000

10000

m = 1

m = 2

m = 3

m = 4

m = 5

m = 6

m = 7

m = 8

m = 9

m = 10

0.003 0.01 0.05 0.1 0.3

10

100

1000

10000

m = 1

m = 2

m = 3

m = 4

m = 5

m = 10

(d) Modified Serial, τ optimized

0.003 0.01 0.05 0.1 0.3

10

100

1000

10000

m = 1

m = 2

m = 3

m = 5

m = 20

(f) Modified Serial, Non−decorrelated

τ optimized

Fig. 7 Memory capacity of the serial-state, modified serial-state,and non-decorrelated modified serial-state model for N = 10,000,x-axis: coding level f ; y-axis: memory capacity. (ab) Serial-statemodel: No matter for τ = 1, or optimized over τ , the minimumcoding level f that the model has nonzero capacity increases withm, and complex model does improve the capacity when f is large.(cd) Modified serial-state model: when the number of depressed

state is reduced to 1 and optimized over τ , the limitation onsparseness due to small initial SNR is removed. The capacityof m > 1 is uniformly better than the two-state model over allcoding levels. (ef) When a non-decorrelating learning rule is used,the capacity decreases as m increases for τ = 1. The improvementin capacity for large coding level disappears


0.003 0.01 0.05 0.1 0.3

1010

010

0010

000

m = 1m = 2

m = 3m = 4m = 5m = 10m = 15m = 20

(a) Hard bound, τ = 1

0.003 0.01 0.05 0.1 0.310

100

1000

1000

0

m = 1m = 2

m = 3m = 4m = 5m = 10m = 15m = 20

(c) Soft bound, τ = 1

0.003 0.01 0.05 0.1 0.3

1010

010

0010

000

m = 1m = 2m = 5m = 10m = 15

(b) Hard bound, τ optimized

0.003 0.01 0.05 0.1 0.3

1010

010

0010

000

m = 1m = 2m = 5m = 10m = 15

(d) Soft bound, τ optimized

Fig. 8 Memory capacity of the hard-bound and soft bound modelfor N = 10,000, using the synapse decorrelation learning rule. x-axis: coding level f ; y-axis: memory capacity

capacity beyond the two-state model for large f(Fig. 7(ab)).

For the modified serial-state model, when optimizedover τ (Fig. 7(cd)), the limitation on sparseness dueto small initial SNR is completely removed. Moreover,the modified serial-state model with m > 1 is uniformlybetter than the two-state model over all coding levels(Fig. 7(d)), especially for large f , though not as much asthe original serial-state model, because the decay rateof the modified serial-state model is larger than thatof the original one. On the other hand, the capacitysaturates at a certain level as m increases. The optimal

capacity for m = 10 is nearly the same as that for m = 5(Fig. 7(d)).

The improvement in capacity for large f is primarydue the the synapse decorrelation learning rule of BenDayan Rubin and Fusi (2007). If synapses are potenti-ated only when both pre- and postsynaptic neurons areactive, and depressed only when the presynaptic neuronis active and postsynaptic is not, as in the sequentialmodel, the capacity of the modified serial-state modelis still worse than two-state model (Fig. 7(ef)).

When the synaptic decorrelation learning rule is ap-plied to the hard bound and soft bound model, bothmodels exhibit improvement beyond two-state modelsfor large coding levels (Fig. 8), including the serial-state, and modified serial-state models. This is becausethe small initial SNR due to large m is compensatedby a large f . The slow decay rate when m is largestarts taking effect and improves the capacity. With-out decorrelation, enlarging f does not enlarge SNRbecause the synaptic correlation increases linearly withf and inflates the noise level. On the other hand,though multi-state models and synaptic decorrelationdo increase capacity for large coding levels, the capacitystill decreases as f increases even when optimized overboth τ and m. The capacity of a network of 10,000 neu-rons is less than 300 for f = 0.1, which is disappoint-ingly low. Moreover, synaptic decorrelation dependsprecisely on the values of q+, q− and q0. Any slightperturbation will ruin the result. Furthermore, synapticdecorrelation does not work for multi-level coding.

4 Simulations

To show the accuracy of the approximate retrievalprobability in Section 2.4, we performed four simula-

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

(A) Hard boundm = 5, f = 0.05, τ =4

0 500 1000 1500 2000 2500

0.0

0.2

0.4

0.6

0.8

1.0

(B) Hard boundm = 5, f = 0.01, τ =3

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

(C) Modified Serialm = 5, f = 0.1, τ =1.27

0 100 200 300 400 500 600 700

0.0

0.2

0.4

0.6

0.8

1.0

(D) Modified Serialm = 5, f = 0.02, τ =2

Fig. 9 x-axis: age of patterns ; y-axis: probability of retrieval.Four simulations with networks of 5,000 neurons for the hardbounds model (a, b) and the modified serial model (c, d) with sixsynaptic states (m = 5). The f and τ are marked on the figures.

Jagged grey line: average number of successful retrievals in 50runs for each age. Solid black line: the non-increasing line thatbest fits the jagged grey line (Robertson et al. 1988). Red dashedline: the approximate retrieval probability


tions with fully connected networks of 5,000 neuronsfor the hard bounds model and the modified serialmodel with 6 synaptic states (m = 5) at two differentcoding levels (Fig. 9). The coding levels and τ aremarked in Fig. 9. In the simulation, the network isfirst trained with a sequence of random stimuli of thespecified coding level. The dynamics is as describedin Eq. (2). The criterion for retrieval is described inSection 2.4 with ε = 0.05.

The jagged gray lines in Fig. 9 are the averagenumber of successful retrievals in 50 runs for eachage of the patterns (age = 1 is the most recent). Thedashed lines are the approximate retrieval probabilitiesin Eq. (21), which resemble the simulation quite wellexcept for (A), where the approximation slightly over-estimates the true retrieval probabilities. Recall thatthe approximation assumes the independence of thefields. When the coding level gets larger, the correlationbetween neurons increases and the approximation isless accurate.

In the simulation, the threshold is set with δ = 0.01.That is, we expect the number of non-selective neuronsabove the threshold to be about 1% of the numberof selective neurons. In the simulation, the number offalse positives relative to the total number of selectiveneurons are

Simulation A B C DFalse positives 1–5% 3–9% <0.2% 0.2–1.2%Skewness 0.151 0.290 −0.044 0.001

higher then the preset value δ = 1%. Note that thenormal approximation to the distribution of the field isaccurate when n is large. For the hard bounds model(simulation A and B), the distribution of the field isright-skewed. The right tail is thus fatter than the nor-mal distribution and the simulated false positive rateis higher then the preset value. For simulation C, thefield is left-skewed, thus the simulated false positiverate is less than the preset value. Here the skewness iscalculated as follows, ignoring the synaptic correlation,

(∑

i πiw3i ) − 3μσ 2 − μ3

σ 3√

N f.

5 Extensions

5.1 Partially connected network

In reality networks are never fully connected, ratherit is estimated that neurons receive on the order ofseveral thousands of synaptic connections onto theirdendrites. For a partially connected network, the ca-

pacity depends heavily on the synaptic connectivityconfiguration, which is beyond the scope of this paper.Here we only study a special case, the randomly con-nected network, where any pair of neurons is randomlyconnected with probability c. The field of a randomlyconnected network is

h(p)

i =∑

j�=iW(p)

ij Cijξ(1)

j − ηc

∑jξ

(1)

j .

Here Cij are independent Bernoulli(c) and denote thepresence of the synapse from neuron j to neuron i.Assume the Cij’s remain fixed throughout learning andpattern retrieval. Given

∑j ξ

(1)

j = n, and ξ(1)

i = x, theconditional mean and variance of the field (Eq. (11))become

M(p)x = n(cμ(p)

x − ηc)

(R(p)x )2 = n[c(σ (p)

x )2 + c(1 − c)(μ(p)x )2]

+ n(n − 1)c2ρ(p)x

The inhibition factor η in Eq. (17) is scaled to give ηc =cη and the new threshold becomes

θc = C(σ 2 + (1 − c)μ2 − cρ)√ρ + σ 2+(1−c)μ2−cρ

N f c + √ρ

(24)

Other parts of the analysis remain unchanged.To show the accuracy of the approximation above, a

simulation is done in a network of N = 6,000 neuronswith connectivity c = .4 (Fig. 10(f)). The simulationresembles the approximation quite well.

A randomly connected network of size N with con-nectivity c is not equivalent to a fully connected net-work of size Nc. Compare the SNR of the two cases,

former : (μ(p)

1 − μ)2

ρ + [σ 2 − ρ + (1 − c)(μ2 + ρ)]/(nc),

latter : (μ(p)

1 − μ)2

ρ + (σ 2 − ρ)/(nc).

The difference could be considerable when c is small.The additional variability in the field is induced by therandom connectivity. The number of synapses feedinginto a neuron is not always precisely Nc. When cdecreases, the capacity will also decrease even whenNc is kept constant. Figure 10 shows the comparisonof the optimal capacity of a fully-connected networkof size 10,000 and a randomly-connected network ofsize 100,000 and connectivity c = 0.1, in the five mod-els in Section 3. The optimal capacity of a randomly-connected network is about 40% less than that of afully connected work. Note the optimal τ for arandomly-connected network is greater than that of afully-connected network.


020

0060

0010

000

(a) Two−state

0.003 0.01 0.05 0.1

fully−connectedq + = 1q + = 0.7q + = 0.5q + = 0.3

randomly−connectedq + = 1q + = 0.7q + = 0.5q + = 0.3

(b) Hard bounds

0.003 0.01 0.05 0.1

1010

010

0010

000

fully−connected

m = 1 (binary)m = 2 m = 3 m = 10

randomly−connected

m = 1 (binary)m = 2 m = 3 m = 10

(c) Soft bounds

0.003 0.01 0.05 0.1

1010

010

0010

000

fully−connected

m = 1 (binary)m = 2 m = 5 m = 10

randomly−connected

m = 1 (binary)m = 2 m = 5 m = 10

0.003 0.01 0.05 0.1 0.3

1010

010

0010

000

fully−connected

m = 1m = 3m = 5m = 20

randomly connected

m = 1m = 3m = 5m = 20

(d) Modified Cascade

0.003 0.01 0.05 0.1 0.3

1010

010

0010

000

fully−connected

m = 1m = 2m = 3m = 10

randomly connected

m = 1m = 2m = 3m = 10

(e) Modified Serial

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

(f) Simulation

Hard boundsm = 5f = 0.02c = 0.4τ = 4

Fig. 10 (a–e) Comparison of the optimal capacity of a fully-connected network of size 10,000 and a randomly-connectednetwork of size 100,000 and connectivity c = 0.1 for each f andm, for the five models considered in Section 3, where the capacityis optimized over τ . x-axis: coding level f ; y-axis: memory ca-pacity. The optimal capacity of a randomly-connected networkis about 40% less than that of a fully connected work. Notethe optimal τ for a randomly-connected network is greater than

that of a fully-connected network. Other behaviors of the twonetworks are similar. (f) Randomly connected network of sizeN = 6,000 neurons for the hard bound model, m = 5, f = .02,c = .4, τ = 4. Jagged grey line—average number of successfulretrievals in 50 runs for each age. Dashed lines—approximateretrieval probabilities. x-axis: age of patterns ; y-axis: probabilityof retrieval

Although a reduction in capacity is observed forrandomly connected networks, our conclusion aboutthe 5 models remain valid: the limitation on sparsenessof multi-state models can be removed by adjusting τ ,only a few synaptic states are relevant to improvingcapacity, and dramatic improvement beyond two-statemodels is not observed.

5.2 Multiple-level coding stimuli

The analysis above assumes all stimuli come with thesame coding level f . This assumption does not appearrealistic as a model for real world inputs. It is rea-sonable to assume that different objects have differentnumbers of features, i.e. neurons, that are activated.Amit and Huang (2010) extended the analysis of theone-level coding setting to multi-level coding in the caseof two-state synaptic models. Here we extend the resultto multi-state models.

Since there is no specific preference to any of theinputs we assume that the distribution of ξ j, j = 1, . . . Nis exchangeable in j. The joint distribution then onlydepends on the number of selective neurons but not onthe specific set of active neurons. It can be summarizedin terms of marginal probabilities

pm,n = P(the first m neurons are 1,

the next n neurons are 0)(25)

where m, n are nonnegative integers with m + n ≤ N.The most general form we will employ for the jointdistribution of the features is

pm,n =∫ 1

0f m(1 − f )nν(df ), (26)

where ν is some distribution on the unit interval. Forexample, if the stimuli come with k different coding


levels f1, . . . , fk, with weights r1, . . . , rk,∑k

i=1 ri = 1,then

pm,n =k∑

i=1

ri f mi (1 − fi)

n.

The two transition matrices change to

P = p2,0 Q11 + p1,1(Q10 + Q01) + p0,2 Q00, (27)

S = p3,0 Q11 ⊗ Q11 + p1,2 Q10 ⊗ Q10

+ p2,1(Q11 ⊗ Q10 + Q10 ⊗ Q11)

+ p2,1 Q01 ⊗ Q01 + p0,3 Q00 ⊗ Q00

+ p1,2(Q01 ⊗ Q00 + Q00 ⊗ Q01). (28)

and the threshold becomes

θ = C(σ 2 − ρ)√ρ + σ 2−ρ

Np1,0+ √

ρ

(29)

where C = Cδ,p1,0 = (1 − δp1,0

p0,1)-quantile of the standard

normal. Other parts of the analysis remain unchanged.

5.3 Associative and feed-forward network

So far the networks we consider are auto-associativenetworks (AAN) that store patterns one-by-one andrequire the self-sustainability of patterns in the networkdynamics. There are also associative networks (AN)that store paired patterns–cue and target–and requireactivities in the cue pattern to evoke activities in thetarget pattern. The feed-forward network (FFN) alsostores paired patterns, but in two layers of network,with cue in one layer and target in another, whereasin AN, both cue and target are in the same layer. Thecue and target pattern might overlap in AN but not inFFN. Though the biological interpretation of AN andFFN are quite different, in capacity analysis they arevery similar.

For AN and FFN, the framework and methodologyin Section 2.1 can be applied directly with slight revi-sion. For AN, suppose the learning rule of the synapseonly depends on whether the presynaptic neuron isselective for the cue pattern (k′ = 0/1) and whether thepostsynaptic neuron is selective for the target pattern(k = 0/1. The learning rule can still be described bythe four matrices Qkk′

, k, k′ = 0/1 in Section 2.1. Theoverlap of cue and target pattern does not pose anyproblem. For FFN, the same rule applies. Assume thecue and target patterns are of coding level fc and ft

respectively. The size of the AN is N. The size of

the cue layer and target layer of FFN are Nc and Nt

respectively. Simply replace P0 and P1 in Eq. (3) by

P1 = fc Q11 + (1 − fc)Q10, P0 = fc Q01 + (1 − fc)Q00.

(30)

and the two transition matrices P and S by

P = ft P1 + (1 − ft)P0, (31)

S = ft P1 ⊗ P1 + (1 − ft)P0 ⊗ P0.. (32)

The mean, variances, and covariances π , γ , μ, σ , ρ, π(p)x ,

γ(p)x , μ

(p)x , (σ

(p)x )2, ρ

(p)x , and the inhibition factor η can

be calculated in exactly the same way as in Sections 2.2and 2.3, except that n is replaced by nc = Nc fc, thesize of the cue pattern, Cδ, f is replaced by Cδ, ft , andx = 0/1 depends on whether the neuron is in the targetpattern. The only subtle difference of AN from FFN isthe following. If a neuron is in the cue pattern, the n inthe formula of μ

(p)x and (σ

(p)x )2 is replaced by nc − 1, not

nc,The threshold θ in Eq. (18) is replaced by

θ = C(σ 2 − ρ)√ρ + σ 2−ρ

N fc+ √

ρ

.

Like the AAN, a quantitative description of the re-trieval criterion for AN and FFN is required. Given ε,and the target pattern size nt, if the field induced by thecue pattern keeps the field of at least (1 − ε)nt targetneurons, then we consider the retrieval is successful.2

This criterion is simpler than the criterion of AANas the stability of the dynamics is not an issue andthe probability of retrieval can be calculated directly.Replace Eq. (19) by

�(p)nc,ε

≈

⎛⎝ θ − nc(μ

(p)

1 − η)√nc(σ

(p)

1 )2 + nc(nc − 1)ρ(p)

1

⎞⎠ (33)

and Eq. (20) by

P(p)nt,nc,ε

≈∑

k≥nt(1−ε)

(nt

k

)(�(p)

nc,ε)k(1 − �(p)

nc,ε)nt−k (34)

and Eq. (21) by

P(p)ε =

Nc∑nc=0

Nt∑nt=0

P(p)nt,nc,ε

(Nt

nt

)f ntt (1 − ft)

Nt−nt

×(

Nc

nc

)f ncc (1 − fc)

Nc−nc . (35)

2The amount of allowable non-selective neurons above thethreshold is not specified in the criterion since it has been con-trolled in the threshold selection.


If ε = 0, P(p)ε can be simplified as

P(p)

0 =Nc∑

nc=0

( ft�(p)

nc,0+ 1 − ft)

Nt

(Nc

nc

)f ncc (1 − fc)

Nc−nc .

The capacity can still be defined as the expected num-ber of retrievable paired patterns, as in Eq. (22).

6 Discussion

Intuitively the larger the synaptic state space, the moreinformation the network should be able to store. Andindeed if the network readout mechanism in the braincan access these synaptic states capacity should increasewith the number of states as in Fusi and Abbott (2007).However, in terms of the concrete readout mechanismdefined in this paper—the expected number of retriev-able patterns—the capacity of the seven families ofmulti-state models considered above does not seem tobe significantly better than that of two state-models.We note that the retrieval probability measure yieldsqualitatively similar results to the simpler SNR analysisand in the case of feedforward networks this simplyfollows from the Gaussian distribution of the fields onthe output neurons (as in Leibold and Kempter 2008and Barret and van Rossum 2008). However in therecurrent setting, retrieval is a function of the networkdynamics and therefore it is not straightforward to con-nect the SNR to the retrieval probability. On the otherhand our simulations show that our proposed approxi-mation of the retrieval probability is very accurate, andindeed is closely related although not identical to theSNR analysis.

Among the multi-state models, those with optimalcapacity are very similar to the two-state models inthat the majority of the synapses concentrate at twosynaptic states. Other states are less or rarely visitedand have little impact on the learning process. Furtherincreasing the state space, the network capacity eitherplateaus or decreases. The improvement in retrievalcapacity beyond the two-state model is small (20–30%at most). The main reason for this phenomenon isthat each training pattern is presented only once, sothat large capacity can be achieved only through fastlearning. Slow learning models will have little or nocapacity since they cannot form memory from one-shot presentations. As the number of synaptic levelsincreases, either the change in efficacy is decreased,or fewer synapses are potentiated, learning becomesslower leading to loss in capacity. If the patterns are al-lowed to be presented repeatedly, slow learning modelscan have larger capacity. In upcoming work we hope

to extend the retrieval probability analysis allowingrepetition in the presentation of the training patterns.

One of the motivations for the work in BenDayan Rubin and Fusi (2007) was the observationthat certain brain regions exhibit relatively high cod-ing levels. The implications for memory capacity withthe original two-state models in Amit and Fusi (1994)were not encouraging where capacity is expected tobe very low at high coding levels. Moreover, bothBen Dayan Rubin and Fusi (2007) and Leibold andKempter (2008) remarked the sparseness of stimulimust be restricted as the synapses grow more complex.However, these constraint are not as stringent as theyappear. Our results show that when the ratio τ betweenthe rate of depression and potentiation is properly ad-justed, complex synapse models can have large capacityeven when the stimuli are sparse (<0.01), and in factcapacity grows with sparseness for optimal τ . For afully-connected network of size 10,000, the capacity canreach 3,000 when f = 0.01, 8,000 when f = 0.005 and12,000 when f = 0.003. In the seven models we consid-ered, the capacity is around 30 for dense coding ( f ≥0.1). For hard-bound and soft-bound models, increas-ing the number of synaptic states makes things worsewhen f is large. Only when the synapse decorrelationlearning rule is used the capacity is improved to 100,but this is still very low. The conclusion is the samefor partially-connected networks though the capacity isreduced.

Regarding the supposed problem of experimentallyobserved high coding levels we note that in the hippo-campus coding levels appear to be sparse ( f = 0.01 −0.04) for both granular (Barnes et al. 1990) and pyra-midal cells (Jung and McNaughton 1993), and f = 0.03in the medial temporal lobe for visual stimuli (Quirogaet al. 2005). In inferotemporal cortex, larger coding lev-els are reported (0.2–0.3) in response to visual stimuli(Rolls and Tovee 1995; Sato et al. 2007). However,as only neurons responding twice as much to faces asto non-faces are included in Rolls and Tovee (1995),i.e., only neurons likely to be selective to visual stimuliare included, the true sparseness could be much lower.Moreover, the measure of sparseness used in Rolls andTovee (1995) and Sato et al. (2007) creates a significantupwards bias especially if underlying coding levels arevery low. This is detailed in the Appendix A.4. Finally,from the perspective of energy consumption, Attwelland Laughlin (2001) and Lennie (2003) estimated thatat any given moment only 2% of the population of cor-tical neurons can afford to be significantly active. Thisimposes another upper bound on the coding level of themammalian cortex. See Olshausen and Field (2004) fora review on other advantages of sparse coding.


Appendix

A.1 Kronecker product

If A is an m × n matrix and B is a p × q matrix, then theKronecker product A ⊗ B is the mp × nq block matrix

A ⊗ B =⎡⎢⎣

a11 B · · · a1n B...

. . ....

am1 B · · · amn B

⎤⎥⎦ .

The Kronecker product is bilinear and associative

– A ⊗ (B + C) = A ⊗ B + A ⊗ C,

– (A + B) ⊗ C = A ⊗ C + B ⊗ C,

– (kA) ⊗ B = A ⊗ (kB) = k(A ⊗ B),

– (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C),

where A, B and C are matrices and k is a scalar. If A,B, C and D are matrices of such size that one can formthe matrix products AC and BD, then

(A ⊗ B)(C ⊗ D) = AC ⊗ BD.

This is called the mixed-product property because itmixes the ordinary matrix product and the Kroneckerproduct.

A.2 Transition matrices and covariances

As the synaptic modification rule is local, the evolutionof the state of a synapse is a Markov chain. Summingover the possible firing states of the pre- and postsy-naptic neuron, the probability a synapse transits fromαm to αm′ is

f 2q11mm′ + f (1− f )q10

mm′ + (1− f ) f q01mm′ + (1− f )2q00

mm′,

so the transition matrix can then be written as

P = f 2 Q11 + f (1− f )Q10 + f (1− f )Q01 + (1− f )2 Q00.

Similarly, a pair of synapses with a common postsy-naptic neuron (J(p)

ij , J(p)

ik ) is also a Markov chain, wherethe state space is the cross product {α1, α2, . . . , αM} ×{α1, α2, . . . , αM}. Again summing over the firing statusof the three neurons i, j, k, the transition probabilityfrom (αl, αm) to (αl′ , αm′) is

f [ fq11ll′ + (1− f )q10

ll′ ][ fq11mm′ + (1− f )q10

mm′ ]+ (1− f )[ f q01

ll′ + (1− f )q00ll′ ][ fq01

mm′ + (1− f )q00mm′ ]

so the transition matrix is

S = f [ f Q11 + (1− f )Q10] ⊗ [ f Q11 + (1− f )Q10]+ (1− f )[ f Q01+(1− f )Q00] ⊗ [ f Q01+(1− f )Q00]

Suppose the network is initialized at its stationarystate. π(0)

x = π and γ (0)x = γ . To obtain the mean and

variance after the first step of learning (Eq. (6)), givenξ

(1)

i = x, since the presynaptic neuron is on ξ(1)

j = ξ(1)

k =1, the transition probability for J(p)

ij from αm to αm′ is

qx1mm′ , and the transition probability for (J(p)

ij , J(p)

ik ) from(αl, αm) to (αl′ , αm′) is qx1

ll′ qx1mm′ . This explains Eq. (6).

Since J(p)

ij is a indicators variable, E[J(p)

ij |ξ (1)

i =x, ξ

(1)

j = 1] and E[J(p)

ij ⊗ J(p)

ik |ξ (1)

i = x, ξ(1)

j = ξ(1)

k = 1]are exactly the p-step distribution of the Markov chainsπ

(p)x and γ

(p)x , and Eq. (7) comes from the Kolmogorov

equations.Equation (8) is straightforward since W(p)

ij = J(p)

ij wT .

One can verify that Var(J(p)

ij |ξ (1)

i = x, ξ(1)

j = 1) =Diag(π

(p)x ) − π

(p)x (π

(p)x )T and deduce Eq. (9), where

Diag(π(p)x ) is the diagonal matrix with π

(p)x on the

diagonal. To obtain Eq. (10) we use:

Cov(JijwT , JikwT) = E[(JijwT)T JikwT ] − μ2

= wE[JTij Jik]wT − μ2

= γ (w ⊗ w)T − μ2.

where the last equality comes from the mixed-productproperty of the Kronecker product (See Appendix A.1).

A.3 Decorrelating the synapses

If the four Q matrices are of the form Q11 = I2m +q+ D+, Q01 = Q10 = I2m + q− D−, Q00 = I2m + q0 D+,where q0 = f 2q+

(1− f )2 , q− = τ f q+1− f , then

P1 = I2m + fq+(D+ + τ D−),

P0 = I2m + f 2q+1 − f

(D+ + τ D−),

P = I + 2 f 2q+(D+ + τ D−)

If π is the stationary distribution of P, i.e. π P = π ,then π(D+ + τ D−) = 0, which implies π P1 = π P0 =π . Hence, γ = π ⊗ π of is the stationary distribution ofS since

(π ⊗ π)S = (π ⊗ π)[ f P1 ⊗ P1 + (1 − f )P0 ⊗ P0]= f (π P1 ⊗ π P1) + (1 − f )(π P0) ⊗ (π P0)

= f (π ⊗ π) + (1 − f )(π ⊗ π) = (π ⊗ π).

Thus the stationary synaptic covariance ρ defined inEq. (14) must be 0.

This learning rule only works for single-level codedstimuli, but not for multi-level coded stimuli.


Tab

le2

Sum

mar

yof

prop

erti

esof

diff

eren

tmul

ti-s

tate

mod

els

Mod

elSt

ates

wτ

πμ

σ2

μ(1

)1

−μ

aIn

itia

lSN

Rb

Har

dbo

und

0,1,

2,..

.,m

( 0,1 m

,..

.,m m

)=

11

m+

1(1

,..

.,1)

0.5

1 12+

1 6mq +

m+

1

12nq

2 +(m

+1)

2

>1

τ−

1

τ−

τ−m

( 1,1 τ

,..

.,1 τm

)1 m

(1

τ−

1−

m+

1

τm

+1−1

)≈

τ

m2(τ

−1)2

≈q + m

≈nq

2 +(τ−1

)2

τ

<1

≈q +

τ

m≈

nq2 +τ

(τ−1

)2

Soft

boun

d{0,

1,2,

...,

m}

( 0,1 m,..

.,m m

)B

inom

ial( m

,1

1+

τ

)(π

0,π

1,..

.,π

m),

πi=( m i)(

1

1+

τ

) i(τ

1+

τ

) m−i ,

0≤

i≤m

1

1+

τ

τ

m(1

+τ)2

≈nτ

q +m

(1+

τ)

≈nτ

q2 +m

Cas

cade

{−1,

−2,..

.,−m

,

1,2,

...,

m}

(0,0..

.,0,

1,1,

...,

1)=

11 2m

(1,..

.,1)

0.5

μ(1

−μ

)q + 2m

nq2 +/

m2

>1

(π−1

,..

.,π

−m,

π1,..

.,π

m)∗

∗∗∗

π1q +

1+

τ

2≈

n(τ

2−1

)π1

4

<1

≈τ

≈n(

1+

τ)π

2 1

τ(1

−τ)

Mod

ifie

dca

scad

e{−

1,1,

2,..

.,m

}(0

,1,

1,..

.,1)

=1

1

m+

1(1

,..

.,1)

m1

+m

μ(1

−μ

)q +

m+

1nq

2 +/m

>1

(π0,π

1,..

.,π

m)∗

∗1

−π

0q +

π0

≈nq

2 +τ

−1

2

<1

≈nq

2 +π0

Seri

al{−

1,−2

,..

.,−m

,

1,2,

...,

m}

(0,0..

.,0,

1,1,

...,

1)=

11 2m

(1,..

.,1)

0.5

μ(1

−μ

)q + 2m

nq2 +/

m2

�=1

τ−

1

1−

τ−2

m(τ

−m,..

.,τ

−1,

τ−m

−1,..

.,τ

−2m

)

1

(1+

τm

)

q +(τ

−1)

τ−m

1−

τ−2

m

nq2 +(

τ−

1)2τ

−m

(1−

τ−m

)2

Mod

ifie

dse

rial

{−1,

1,2,

...,

m}

(0,1,

1,..

.,1)

=1

1

m+

1(1

,..

.,1)

m1

+m

μ(1

−μ

)q +

m+

1nq

2 +/m

�=1

τ−

1

τ−

τ−m

( 1,1 τ

,..

.,1 τm

)τ

m−

1

τm

+1−

1q +

(1−

μ)

nq2 +(

τ−

1)

1−

τ−m

aμ

(1)

1=

πQ

11w

.b

The

init

ialS

NR

=n(

μ(1

)1

−μ

)2/σ

2,w

here

the

syna

ptic

cova

rian

ceis

igno

red.

∗F

orth

eca

scad

em

odel

:π

1=

π−1

=τ−1

τ+1

(τrm

−22

−rm

−21

/τ)−

1,π

i=

π1r

i−1

1,π

−i=

π1r

i−1

2fo

r1

≤i<

m,π

m=

π1r

m−2

1/τ

,π

−m=

π1r

m−2

2τ

,an

dμ

=(1

−rm

−21

)(τ

rm−2

2−

rm−2

1/τ)−

1.

Her

er 1

=2/

(1+

τ),

r 2=

2τ/(1

+τ).

∗∗F

orth

em

odif

ied

casc

ade

mod

el:π

0=

τ−1

τ+1

(1−

rm−1

/τ)−

1,π

i=

π0r

i ,for

1≤

i<m

,πm

=π

0rm

−1/τ

.Her

er

=2/

(1+

τ)


0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

f

aL = 10L = 30L = ∞

Fig. 11 The sparseness measure a (Eq. (36)) versus the actualcoding level f for L = 10 and 30

A.4 Bias in sparseness measurement

In Rolls and Tovee (1995) and Sato et al. (2007) codinglevel is defined as

a = (∑n

i ri/n)2∑ni r2

i /n(36)

where ri is the firing rate of the neuron to the ithstimulus in a set of n stimuli. Say the baseline firingrate of the neuron is r, and if the neuron is selective tothe stimulus, the firing rate is Lr (L > 1), and assumethe true sparseness is f . Then on average

∑i ri/n ≈

f Lr + (1 − f )r,∑

i r2i /n ≈ f L2r2 + (1 − f )r2.

a ≈ ( f L + 1 − f )2

f L2 + 1 − f> f

Though a → f as L → ∞, when f → 0, a ≈ 1 − f .The relationship between a and f is not monotoneand when L = 30, a is always above 0.1 whatever f is(Fig. 11). If the baseline firing rate is subtracted fromeach ri, then L will be larger and make a closer to f .

References

Abraham, W. C., & Bear, M. F. (1996). Metaplasticity: The plas-ticity of synaptic plasticity. Trends in Neurosciences, 19(4),126–130. doi:10.1016/S0166-2236(96)80018-X.

Amit, D. J., & Brunel, N. (1997a). Dynamics of recurrent networkof spiking neurons before and following learning. Network,8, 373–404.

Amit, D. J., & Brunel, N. (1997b). Model of global spontaneousactivity and local structured activity during delay periods inthe cerebral cortex. Cerebral Cortex, 7, 237–252.

Amit, D. J., & Fusi, S. (1994). Learning in neural networks withmaterial synapses. Neural Computation, 6, 957–982.

Amit, D. J., & Mongillo, G. (2003). Selective delay activity in thecortex: Phenomena and interpretation. Cerebral Cortex, 13,1139–1150.

Amit, Y., & Huang, Y. (2010). Precise capacity analysis in binarynetworks with multiple coding level inputs. Neural Compu-tation, 22(3), 660–688. doi:10.1162/neco.2009.02-09-967.

Attwell, D., & Laughlin, S. B. (2001). An energy budget forsignaling in the grey matter of the brain. Journal of CerebralBlood Flow and Metabolism, 21(10), 1133–1145.

Barnes, C. A., McNaughton, B. L., Mizumori, S. J., Leonard, B.W., & Lin, L. H. (1990). Comparison of spatial and tempo-ral characteristics of neuronal activity in sequential stagesof hippocampal processing. Progress in Brain Research, 83,287–300.

Barret, A. B., & van Rossum, M. C. (2008). Optimal learningrules for discrete synapses. PLoS Computational Biology, 4,1–7.

Ben Dayan Rubin, D. D., & Fusi, S. (2007). Long memorylifetimes require complex synapses and limited sparseness.Frontiers in Computational Neuroscience, 1, 1–14.

Brunel, N. (2003). Dynamics and plasticity of stimulus-selectivepersistent activity in cortical network models. Cerebral Cor-tex, 13, 1151–1161.

Curti, E., Mongillo, G., La Camera, G., & Amit, D. J. (2004).Mean-field and capacity in realistic networks of spiking neu-rons storing sparsely coded random memories. Neural Com-putation, 16, 2597–2637.

Del Giudice, P., Fusi, S., & Mattia, M. (2003). Mod-elling the formation of working memory with net-works of integrate-and-fire neurons connected by plasticsynapses. Journal of Physiology Paris, 97(4–6), 659–681.doi:10.1016/j.jphysparis.2004.01.021.

Fusi, S., & Abbott, L. F. (2007). Limits on the memory stor-age capacity of bounded synapses. Nature Neuroscience, 10,485–493.

Fusi, S., Drew, P. J., & Abbott, L. (2005). Cascade modelsof synaptically stored memories. Neuron, 45(4), 599–611.doi:10.1016/j.neuron.2005.02.001.

Fuster, J. (1995). Memory in the cerebral cortex: An empiricalapproach to neural networks in the human and nonhumanprimate. Cambridge, MA: MIT Press.

Jung, M. W., & McNaughton, B. L. (1993). Spatial selectivity ofunit activity in the hippocampal granular layer. Hippocam-pus, 3(0), 165–182.

Leibold, C., & Kempter, R. (2008). Sparseness constrains theprolongation of memory lifetime via synaptic metaplasticity.Cerebral Cortex 18, 67–77.

Lennie P (2003) The cost of cortical computation. Current Biol-ogy, 13(6), 493–497. doi:10.1016/S0960-9822(03)00135-0.

Miyashita, Y., & Hayashi, T. (2000). Neural representation ofvisual objects: Encoding and top-down activation. CurrentOpinion in Neurobiology, 10(2), 187–194.

Olshausen, B. A., & Field, D. J. (2004). Sparse coding of sensoryinputs. Current Opinion in Neurobiology, 14(4), 481–487.doi:10.1016/j.conb.2004.07.007.

Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., & Fried, I.(2005). Invariant visual representation by single neurons inthe human brain. Nature, 435, 1102–1107.

Robertson, T., Wright, F. T., & Dykstra, R. L. (1988). Orderrestricted statistical inference. Wiley.

http://dx.doi.org/10.1016/S0166-2236(96)80018-X

http://dx.doi.org/10.1162/neco.2009.02-09-967

http://dx.doi.org/10.1016/j.jphysparis.2004.01.021

http://dx.doi.org/10.1016/j.neuron.2005.02.001

http://dx.doi.org/10.1016/S0960-9822(03)00135-0

http://dx.doi.org/10.1016/j.conb.2004.07.007


Rolls, E. T., & Tovee, M. J. (1995). Sparseness of the neuronalrepresentation of stimuli in the primate temporal visual cor-tex. Journal of Physiology, 73(2), 713–726.

Romani, S., Amit, D., & Amit, Y. (2008). Optimizing one-shot learning with binary synapses. Neural Computation, 20,1928–1950.

Sato, T., Uchida, G., & Tanifuji, M. (2007). The nature of neu-ronal clustering in inferotemporal cortex of macaque mon-

key revealed by optical imaging and extracellular recording.In 34th Ann. meet. of soc. for neuroscience. San Diego, USA.

Wang, X. J. (2001). Synaptic reverberation underlying mnemonicpersistent activity. Trends in Neurosciences, 24(8), 455–463.doi:10.1016/S0166-2236(00)01868-3.

Willshaw, D., Buneman, O. P., & Longuet-Higgins, H. (1969).Non-holographic associative memory. Nature (London),222, 960–962.

http://dx.doi.org/10.1016/S0166-2236(00)01868-3

Date post:	15-Apr-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Capacity analysis in multi-state synaptic models: a...

Documents