CS229 FINAL REPORT, FALL 2015 1
Neural Memory NetworksDavid Biggs and Andrew Nuttall
I. INTRODUCTION
CONTEXT is vital in formulating intelligent classifica-tions and responses, especially under uncertainty. In a
standard feed-forward neural network (FFNN), context comesin the form of information encoded in the input vector andtrained in weight parameters. However, useful information canalso be present in the temporal nature of the input vectors,or from past internal states of a network. Future outputs canachieve better accuracy by observing transient trends in theinput data, or by utilizing key memories from distant inputswhich could be crucial to formulating a correct output. Byproviding a neural network with an architecture for storingand maintaining memories this additional context can beeffectively leveraged.
A simple implementation of memory in a neural networkwould be to write inputs to external memory and use thisto concatenate additional inputs into a neural network. Fornoisy analog inputs, memory inputs pulled from Gaussiandistributions can act to preprocess and filter the data. Fig. 1shows a schematic and memory weight distribution of a FFNNwith external memory augmentation.
(a) Network schematic
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Memory Locations
Weig
hts
(b) Weights of memorymatrix
Fig. 1: Architecture of FFNN with simple memory implemen-tation
This architecture was implemented to predict the nextsequence in a noisy sinusoidal signal. Ten previous inputswere stored in memory, from which five additional inputs weredrawn from Gaussian weights on the memory. The network istrained with a single sinusoid then tested with a sum of threesinusoids of varying frequencies with errors around 10% fortraining and 15% for testing. The convergence and input/outputwaveforms are shown in Fig. 2.
Network architectures with delayed inputs or additional in-puts drawn from memory can be useful tools, but are limited infunctionality since the memory architecture is predetermined.A more powerful memory architecture would store memoryinside the network to allow the network to learn how to utilizememory via read and write commands.
II. RELATED WORK
Neural networks were first introduced over 60 years agoas one of the first learning algorithms. Recurrent neuralnetworks (RNN) were then introduced in the 1980’s to betterprocess sequential inputs by maintaining an internal statebetween inputs. Long-Short-Term-Memory (LSTM) improvedupon RNN’s in the late 1990’s by adding logic gates to read,write, and forget internal memory states. Recently, Weston etal. [1] defined a framework for neural networks to interactwith external memory in order to read and write long termmemory. In their paper, Westin implements an RNN withmemory for textual story processing in which several actorsmove between rooms and while carrying and deposit objects.In their results they showed that with memory their networkoutperformed a similar RNN and LSTM without memoryaccess at answering questions about a story. Around the sametime, work was published by Graves et al. [2] in which theyimplement an algorithm they call the Neural Turing Machine(NTM). The NTM algorithm has memory structures calledmemory heads that can be accessed in order to read, write, anderase data. They ran several test on their NTM, one of whichwas teaching the network to store then copy sequences ofnumbers. They trained their network up to sequences of twenty8 bit numbers and tested up to sequences of 120 numberswith good results. This project differs from related work byimplementing internal neural memory by assigning a set ofmemory nodes (memory bank) to each standard neuron, tobe discussed in the following section. With this approach aricher memory can be achieved by not only maintaining keymemories, but by retaining memory histories with associatedtemporal information.
III. METHODOLOGY
The neural memory network (NMN) architecture is com-prised of an FFNN with a dedicated internal memory bankassociated with each standard neuron in each of the hiddenlayers. The input and output layers are composed of onlystandard nodes, while the hidden layers are composed ofstandard nodes with their respective memory nodes. In thisformulation, the output of each node is given by a sigmoidfunction of an affine combination of its inputs. The output ofa node is used as a linear switch to activate the memoriesdedicated to that node. During forward propagation the neuralnetwork can learn to call upon these memories as neededto provide additional information in making classifications.Figure 3 depicts the general network layout for a simplememory neural network.
Training and testing of the algorithm is comprised of threekey steps:
CS229 FINAL REPORT, FALL 2015 2
0 2 4 6 8
Training Points ×105
5
10
15
20
25
30
35
40
45
50
55
Err
or
[%]
Training Error
Test Error
(a) Error convergence
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.5
1
Training Data
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.5
1
Testing Data
Input Signal
Output Signal
(b) Waveforms
Fig. 2: Signal prediction results from memory augmented FFNN
Fig. 3: A neural memory network has input and outputlayers composed of standard sigmoid nodes and hidden layerscomposed of both standard and memory nodes. Each standardnode in a hidden layer has a memory bank associated withit, and each memory bank is composed of multiple memorynodes. The outputs of standard nodes are used as switches toactivate memories when called upon.
• Forward propagation of the input across each networklayer to form the output
• Back propagation of the errors across each layer startingat the output to perform gradient descent updates onnetwork parameters
• Memory storage in each layer of nearby neuron outputsas an orthogonal set in a higher dimensional space
A. Forward Propagation
Forward propagation in an NMN functions in a similarmanner to that of an FFNN, with an augmented output vectorfrom each hidden layer. The output of each layer i is givenby a two-step equation containing outputs from both standardneurons and memory neurons,
x(i) = f(W(i−1)x(i−1) + b(i−1)
)(1)
x(i) :=[x(i)1 , x
(i)1 M
(i)>1 , ..., x
(i)k , x
(i)k M
(i)>k
]>(2)
The output state vector for each layer’s standard nodes isaugmented with the output of the layer’s memory nodes. Theoutput of each memory node is the product of the content ofthat memory slot and the output of the standard node it isattached to. With this approach the standard node acts as alinear switch to turn memories on or off as desired. Duringforward propagation in the hidden layers the outputs of ahidden layer only go to standard nodes, memory nodes onlyreceive input from their respective standard node in the samelayer.
B. Backward Propagation and Gradient Descent
The network is trained by updating the parameters W andb using gradient descent with the back-propagation technique.In the training process, an input is forward propagated throughthe neural network to produce an hypothesis h(xj) that is thencompared to a known solution y. The error is calculated as
ε =1
2||h(x(output)
)− y||22. (3)
Performing gradient descent on an NMN requires differentconsiderations to be given to the set of first and last layers andthe set of hidden layers due to the network’s asymmetry. Toperform gradient descent on the weights the partial derivativesw.r.t w(i)
j,k are taken as,
∂ε
∂W(i)jk
=∂ε
∂h(x(i)j )
∂h(x(i)j )
∂x(i)j
∂x(i)j
∂W(i)jk
(4)
Evaluating this expression for the final layer I of weightsyields
∂ε
∂W(I)jk
=(h(x
(I)j )− y
)h(x
(I)j )(1− h(x
(i)j ))xI−1 (5)
Back propagation is implemented in the standard way bystoring gradients in the variable δ and passed to previouslayers. This process starts at the output layer as,
δ(output) = (x(output) − y)x(output)(1− x(output)) (6)
CS229 FINAL REPORT, FALL 2015 3
The errors of the subsequent layers are weighted by theparameters
δ(i) = W(i+1)δ(i+1)x(i+1)(1− x(i+1)) (7)
There are two paths back to the primary neurons, so the errorsof each hidden layer must be summed across memory nodes
δ(i)j = δ
(i)j +
∑k
δ(i)j+kM
(i)j+k (8)
The updates on W and b are then,
W(i) := W(i) − αδ(i)x(i+1) (9)
b(i) := b(i) − αδ(i) (10)
C. Memory Architecture
Every node in the hidden layers contains a static memorybank. Each unique memory in a node is orthogonal to all othermemories in that node, such that no information is lost whenthe memories are added together. A collection of memoriesis then represented as a single vector in a higher dimensionalspace called a composite memory. The direction of the vectordictates what memories are present, and the magnitude of thevector when projected onto each memory’s axis determinesthe order in which the memories occurred. When a nodefires the current composite vector that is stored in memoryis scaled according to some scheduled decay and then thenew memory is added. The new memory is created by anorthogonal mapping of the outputs the node and its τ nearestneighbors. Nearest neighbors can be defined in a spatial senseto be nodes in the same layer with close indices, or it can alsobe defined to include nodes in previous and future layers. Thememory bank update equation for each node is given by
M(i)j := γ
⌊x(i)j
⌉M
(i)j + g
([x(i)j−τ/2, ... x
(i)j , ..., x
(i)j+τ/2
])(11)
where g(x) : Rτ+1 → R2τ+1
, is a function mapping theoutputs of its τ nearest neighbors to 2τ+1 dimensional space,and γ is the given decay schedule which is turned on and offby the rounded output of the standard node. When a node fires
γ
⌊x(i)j
⌉evaluates to γ, and if it does not fire γ
⌊x(i)j
⌉evaluates
to 1.During the training phase memories are allowed to decay to
zero quickly to provide the network the plasticity to convergeto its desired orientation. During use memories can be retainedindefinitely, or according to any desired decay schedule. Withan asymptotic memory decay schedule, transient informationcan be retained for the more recent memories, and any remain-ing memories which have hit the asymptotic magnitude are stillretained, but their time ordering can no longer be comparedto other memories which have also hit the asymptotic decaymagnitude.
IV. EXPERIMENTS/RESULTS
The NMN algorithm was implemented and then tested withseveral test data sets:• Sorted Classification• Classification and Recall
Fig. 4: Each standard node stores a composite memory inits dedicated memory bank. A composite memory is the sumof all of the orthogonal memories associated with that node,such that the original memories are retained by projecting thecomposite memory onto different dimensions. In the nearestneighbor approach the activation permutations of a cell’s near-est neighbors is mapped to a higher dimensional orthogonalmemory and then added to the cell’s composite memory.
• Memory Recall• Extended RecallEach of the test sets are meant to evaluate the various
functionality of the algorithm. For comparison, a standardFFNN (NMN with memory turned off) and an RNN availablein the MatLab Neural Network toolbox (layrecnet()) were alsotrained and evaluated with the test data sets. layrecnet() hasone hidden layer with feedback and one output layer; a blockdiagram of is shown in Fig. 6.
A. Sorted Classification
A data set composed of points sampled from 10 randomlygenerated multivariate Gaussian distributions, with the goalbeing to classify which distribution a point was sampled from.Samples were drawn from the distributions in a stationaryorder. Due to overlap in the distributions a standard feedforward neural net was only able to achieve 4.5% classificationerror. By implementing memory the neural network was ableto identify that there was a pattern to the input points andreduces classification error to 0.7%, an improvement of 600%.An RNN with a single feedback delay was also tested on thedata, with classification error of 1.8%. Convergence results areshown in Fig. 7.
B. Classification and Recall
The task of this data set was to classify an input and alsostate the previous two classifications it made, in the orderit made them. A memory-less neural network can performno better than converging to the most prolific output on thistype of task. A memory based neural network with the sameparameters was able to achieve 7% classification error on thesame data set. An RNN with a single feedback delay wasalso tested on the data, with classification error of < 0.1%.Convergence results are shown in Fig. 8.
C. Memory Recall
The networks were given sequences of 5 bit numbers towrite into memory, followed by empty input vectors of zeros
CS229 FINAL REPORT, FALL 2015 4
Node Firings
0 20 40 60 80 100
Com
posite M
em
ory
Magnitude
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Memory Vector Magnitude
No Memory Permanence
Memory Permanence
Node Firings
0 20 40 60 80 100
Com
ponent M
agnitude
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Memory Component Magnitude
Memory Occurs
Fig. 5: Every time a standard node is activated that memories associated with that node have their magnitude decrementedaccording to some decay schedule to allow for the retention of temporal information. Memories can either decay out ofexistence, or can be given some asymptotic decay.
Fig. 6: Block diagram of layrecnet(), a layered recurrent neuralnetwork function available in the MatLab Neural NetworkToolbox.
0 20 40 60 80 100 120 140 160
10−2
10−1
100
Epochs
Tra
inin
g E
rror
FFNN
NMN
RNN
Fig. 7: A comparison of error convergence between an FNN,NMN, and RNN on a classification task demonstrates theability of the NMN to leverage temporal patterns in the inputdata.
of the same series length to signify to the networks to readand output the stored sequences. The NMN algorithm wasset with one internal memory layer to process the data.Both memory writing procedures were tested, external writingwhich recorded the input layer upon a standard neuron firing,and nearest neighbor which recorded nearby neuron outputs asoutlined above. Convergence results for sequences of lengthtwo are shown in Fig. 9.
0 50 100 150 200 250 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Epochs
Tra
inin
g E
rror
FFNN
NMN
RNN
Fig. 8: A convergence plot for the classification and recalltask, where outputs are given by the current classification andthe previous two classifications.
0 10 20 30 40 50 60 700
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Epoch
Tra
inin
g E
rro
r
RNN
NMN − External Write
NMN − Nearest Neighbor
Fig. 9: Convergence plot of the memory recall task for thevarious networks, RNN, NMN with an input write function,and NMN with nearest neighbor write function.
CS229 FINAL REPORT, FALL 2015 5
Epoch
0 10 20 30 40
Te
stin
g E
rro
r
0
0.2
0.4
0.6
0.8
1NMN Total Error
NMN Long Recall Error
RNN Total Error
RNN Long Recall Error
Fig. 10: Convergence plot for the extended recall task. Inthis data set, some classifications depend only on the currentinputs, but some depend on inputs from 25 time units ago. Astandard RNN is unable to perform those long term classifi-cations better than randomly guessing, while the NMN is ableto achieve near perfect long term recall.
D. Extended Recall
In the extended recall data set the task was to performclassifications, with some classifications depending only uponthe current input, and some classifications depending oninformation from 25 inputs prior. Both the RNN and theNMN were able to achieve perfect testing performance onthe classification task where the outputs only depending uponthe current input. However, in the case of outputs dependingupon data from 25 inputs ago the RNN was only able toachieve random guess (66% error) levels of performance,while the NMN was able to achieve near perfect long termrecall. The NMN in this setup was composed of 2 hiddenlayers. Convergence results for all four error rates are shownin Fig. 10.
V. CONCLUSION
We have developed and demonstrated a feed-forward neuralnetwork algorithm that stores memory associated with stan-dard hidden layer neurons. The benefit of neural memoryis shown to give context to the algorithm while computingpattern recognition and time series data sets. Memories arestored with a higher dimensional mapping to ensure orthogo-nality, and memories are decremented upon recall to maintaina time history. The neural memory network was evaluatedwith several data sets of classification and recall and againsta similar feed-forward neural network without memory and acommercial recurrent neural network. The results show thatalthough indeed the neural memory network indeed learns toutilize memory in order to solve problems requiring context,in the current implementation it does not outperform thecommercial recurrent neural network for most tasks. On theother hand, neural memory allow for the network to excelat long term memory storage and recall, a functionality thatescapes the standard recurrent neural network.
VI. FUTURE WORK
The algorithms described in this paper are only a firstapproach towards the NMN architecture. Future developmentswill depend upon un-linking the read and write operators,memory scaling issues, training schedules, and the generaloptimization techniques used to perform gradient descent.
In this implementation the NMN learns to recall informationfrom memory as needed, but it is not explicitly trained on howand when to write information. To combat this, in our currentimplementation we have tied together the write and readcommands so that any potentially useful information is savedto memory in the appropriate memory slots as determined bythe read command. However, to get a more robust performancethe read and write commands need to be learned separatelyand independently.
In the current nearest neighbor approach, the size of thememory banks scale according to 2τ , which can cause majorperformance issues if τ becomes too large. This constraint isintroduced to maintain the orthogonal nature of the memoriesto avoid the problem of having to learn to reference specificmemories. With orthogonal memories, a composite memory isreferenced instead of a single component memory. Instead ofthis approach, the permutations of the nearest neighbors can beused to address a specific memory location in a nodes memorybank through a binary to linear mapping, thus eliminatingthe orthogonal constraint and the 2τ scaling law. An exampleoutput of a memory node with 2 nearest neighbors with thisapproach could be given by
M1 = (1− x3)(1− x2)x1M1,1+
(1− x3)x2x1M1,2+
x3(1− x2)x1M1,3+
x3x2x1M1,4
(12)
which is a differentiable expression for selecting a memoryfrom a learned location. In this expression x2 and x3 are theoutputs of the two nearest neighbors. This expression couldbe further processed with some type of focusing technique,but this approach allows for linear memory scaling and doesimpose an orthogonal constraint.
To allow for more accurate comparisons between the NMNand other off the shelf algorithms the optimization techniquesused to perform gradient descent on the NMN can be improvedupon to allow for better convergence. The implementation ofa dynamic step-size achieved via a line search could prove tobe very effective.
REFERENCES
[1] Weston, Jason, Sumit Chopra, and Antoine Bordes. Memory networks.arXiv preprint arXiv:1410.3916 (2014).
[2] Graves, Alex, Greg Wayne, and Ivo Danihelka. Neural Turing Machines.arXiv preprint arXiv:1410.5401 (2014).