1
Mapping of Neural Networks onto the Memory-ProcessorIntegrated Architecture�
Youngsik Kim
Mi–Jung Noh
Tack–Don Han
Shin–Dug Kimy
y Corresponding Author
Dept. of Computer Science, Yonsei University
134, Shinchon–Dong, Seodaemun–Ku, Seoul 120–749, Korea
Tel. : +82–2–361–2715 Fax : +82–2–365–2579
Submitted toNeural Networksin March 1997
Revised in May 1998
� A preliminary version of this paper appeared inProc. Int’l Conf. Neural Networks’ 97.
� This study was supported by the academic research fund of Ministry of Education, Republic
of Korea through Inter–University Semiconductor Research Center(ISRC 97–E–2022) in Seoul
National University.
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 2
Mapping of Neural Networks onto the Memory-ProcessorIntegrated Architecture
ABSTRACT
In this paper an effective memory–processor integrated architecture,calledmemory–based
processor array for artificial neural networks(MPAA), is proposed. The MPAA can be easily
integrated into any host system via memory interface. Specifically, the MPAA system provides
an efficient mechanism for its local memory accesses allowed by the row basis and the column
basis using the hybrid row and column decoding, which is suitable for the computation model
of ANNs such as the accessing and alignment patterns given for matrix–by–vector operations.
Mapping algorithms to implement the multilayer perceptron with backpropagation learning on
the MPAA system are also provided. The proposed algorithms support both neuron and layer
level parallelisms which allow the MPAA system to operate the learning phase as well as the
recall phase in the pipelined fashion. Performance evaluation is provided by detailed compari-
son in terms of two metrics such as the cost and the number of computation steps. The results
show that the performance of the proposed architecture and algorithms is superior to those of the
previous approaches, such as one–dimensional single instruction multiple data (SIMD) arrays,
two–dimensional SIMD arrays, systolic ring structures, and hypercube machines.
Key words : parallel processing, memory–processor integration, multilayer perceptron, back-
propagation learning, algorithmic mapping.
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 3
1 Introduction
Artificial neural networks (ANNs) have been widely used in various applications such
as pattern classification, speech recognition, machine vision, optimization, matching, image
restoration, and so forth. Many algorithmic mapping techniques to implement ANNs on the
available parallel architectures considering the inherent parallelism of ANNs have been reported
(El–Amawy & Kulasinghe, 1997; Ghosh & Hwang, 1989; Kumar, Shekhar, & Amin, 1994b;
Kung & Hwang, 1989; Lin, Prasanna, & Przytula, 1991; Malluhi, Bayoumi, & Rao, 1995;
Nordstrom & Svensson, 1992; Singer, 1990; Svensson & Nordstrom, 1990; Wah & Chu, 1990).
A number of algorithms mapped onto various architectures were surveyed in (Nordstrom
& Svensson, 1992). Examples of algorithmic mapping schemes are the implementation of
ANNs on two–dimensional single instruction multiple data (SIMD) arrays (Lin, Prasanna, &
Przytula, 1991; Singer, 1990), one–dimensional SIMD arrays (Svensson & Nordstrom, 1990),
the cascaded systolic ring arrays (Kung & Hwang, 1989), the hypercube architectures (Kumar,
Shekhar, & Amin, 1994b; Malluhi, Bayoumi, & Rao, 1995), the multicomputers (Ghosh &
Hwang, 1989; Wah & Chu, 1990), and the multiple bus systems (El–Amawy & Kulasinghe,
1997). The mapping algorithms proposed in (Lin, Prasanna, & Przytula, 1991) are efficient for
a network topology as long as the interconnections among neurons are sparse. However, this
scheme needed a large number of processors (O(N2) if N is the number of neurons at the largest
layer). In order to improve inefficient inter–processor communications in one–dimensional
SIMD array, an adder tree hardware was proposed in (Svensson & Nordstrom, 1990). A mapping
technique (Malluhi, Bayoumi, & Rao, 1995) on hypercube architectures can take the optimal
computation steps (O(log2N) if N is the number of neurons at the largest layer) in spite of the
large number 4N2 of processors. In (Kumar, Shekhar, & Amin, 1994b), a mapping technique
calledcheckerboardingon hypercube and related architectures was proposed. Checkerboarding
can avoid the all–to–all broadcast operation. A mapping scheme (El–Amawy & Kulasinghe,
1997) implemented on the multiple bus systems has a relative merit over checkerboarding
scheme. However, the processors in the multiple bus systems should support the complex
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 4
communication of the dynamic interconnection structure. An analytical model assessing the
performance of ANNs implemented on linear arrays was presented in (Naylor & Jones, 1994).
This paper provides a further consideration of typical four mapping schemes on two–dimensional
SIMD arrays (Singer, 1990), one–dimensional SIMD arrays (Svensson & Nordstrom, 1990), the
cascaded systolic ring arrays (Kung & Hwang, 1989), and the hypercube architectures (Malluhi,
Bayoumi, & Rao, 1995) in the later section 4.1.
Because the current memory technology can support the gigabit DRAMs, a single memory
chip would cover the memory volume needed for the computer systems in the future. A number
of studies (Aimoto et al., 1996; Elliott, Snelgrove, & Stumm, 1992; Gokhale, Holmes, & Iobst,
1995; Inoue, Nakamura, & Kawai, 1995; Kogge, 1994; Shimizu et al., 1996; Yamashita et al.,
1994) for the memory–logic integration have utilized both high internal memory bandwidth
and the available chip density. For computer graphics, a large amount of DRAMs and a
small number of logic circuits are integrated into a 3–D DRAM chip (Inoue, Nakamura, &
Kawai, 1995). A processor and memory integration onto a chip (Shimizu et al., 1996) and
multiple instruction stream multiple data stream (MIMD) multiprocessors with their on–chip
local memories (Kogge, 1994) were proposed in order to overcome the low bandwidth to the local
memory. Also, memory–processor integrated arrays, such as computational RAM (C–RAM)
(Elliott, Snelgrove, & Stumm, 1992), integrated memory array processor (IMAP) (Yamashita et
al., 1994), processing in memory (PIM) (Gokhale, Holmes, & Iobst, 1995), and parallel image
processing RAM (PIP–RAM) (Aimoto et al., 1996) which integrate the SIMD processors and
their local memories within a chip have been proposed. However, an algorithmic study of ANNs
has not been applied to the aforementioned memory–processor integrated architectures.
In this paper, a memory–processor integrated architecture efficiently supporting the com-
putation model of ANNs, calledmemory–based processor array for ANNs(MPAA), is proposed.
The parallel algorithms of the multilayer perceptron method with backpropagation learning are
also mapped onto the MPAA system. The proposed architecture and its associated algorithms
show several advantages. First, previous architectures and algorithms providing synaptic weight
level parallelism, e.g., two–dimensional SIMD arrays (Lin, Prasanna, & Przytula, 1991; Singer,
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 5
1990) and hypercube MIMD architecture (Malluhi, Bayoumi, & Rao, 1995), often needed a
large number of processors, but the proposed architecture and algorithms practically supporting
neuron and layer level parallelisms need a moderate number of processors (O(N) if N is the
number of neurons at the largest layer). Second, in the MPAA system, any interaction by
the program and data between the host processor and the MPAA system can be resolved by
means of simple memory reads and writes. Also, for any given execution cycle each processing
units (PUs) of the MPAA system can execute a single instruction with an operand fetch in the
overlapped fashion for the effective use of high bandwidth given by the local memory.
Third, the MPAA system provides an efficient mechanism for memory accesses by the
row or column basis which is suitable for the computation model of ANNs. Because the
basic computation of ANNs can be represented by a series of matrix–by–vector and transposed
matrix–by–vector multiplications, where the matrices contain the synaptic weights and the
vectors contain activation values or error values, the architectures for ANNs have to support
the accessing and alignment patterns given by the matrix–by–vector operations (Nordstrom &
Svensson, 1992). The memory–processor integrated arrays in (Aimoto et al., 1996; Elliott,
Snelgrove, & Stumm, 1992; Gokhale, Holmes, & Iobst, 1995; Yamashita et al., 1994) were able
to access the memory only by the row basis, but the MPAA system can access the memory by
the row and/or column basis by using the hybrid row and column decoding. This capability
can replace some patterns of inter–PU communications with simple memory reads and writes.
Therefore, the MPAA system supports efficiently the matrix–by–vector multiplications without
any inter–PU communication and provides a new computation method for various linear algebra
applications as well as ANN computations.
Fourth, the proposed algorithms can adopt both neuron and layer level parallelisms by
using architectural features of the MPAA. In the MPAA, the pattern level pipeling can be applied
to the learning phase as well as the recall phase. Thus the number of computation steps for
ANNs on the MPAA system is of small number comparing with that of other approaches except
for hypercube architectures. In general the number of computation steps does not provide
any meaningful and fair information to compare various architectures with their corresponding
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 6
PM
PU
PM
PU IP
IMIL
control
PM
PU
HP
HM
control IU
SM. . .
dataaddresssingle address space
data&address controladdress &
system bus
MPAA
(a) The conceptual memory structure
System Bus
IL
addressIU
PUAB
PUCBdata
enable
control
AddressControl
DecoderI/O
IP
IM
PUAB : PU Address Bus PUCB : PU Command Bus
(b) The Interface Unit
Figure 1: The MPAA system architecture.
algorithms. In this paper, in order to perform a fair comparison, a cost function is used to
denote any performance given over the number of processors. Thus, the performance of ANN
algorithms mapped onto the MPAA is compared in detail with that of typical four schemes
(Kung & Hwang, 1989; Malluhi, Bayoumi, & Rao, 1995; Singer, 1990; Svensson & Nordstrom,
1990) in terms of the cost as well as the number of computation steps. The MPAA system can
reduce about 24.81%� 98.49% of the cost given by other architectures with their corresponding
algorithms (Kung & Hwang, 1989; Malluhi, Bayoumi, & Rao, 1995; Singer, 1990; Svensson &
Nordstrom, 1990).
In the following section, the MPAA system is described. In Section 3, the algorithms
of the multilayer perceptron with backpropagation learning mapped onto the MPAA system
are proposed. In Section 4, the MPAA system with the proposed algorithms is compared with
previous architectures and algorithms. Finally, Section 5 provides a conclusion.
2 The MPAA System Architecture
This section describes the architectural features of the MPAA system. The design issues
are also presented to construct the MPAA system. An effective interfacing mechanism with any
host system is designed as the basic building block to form a complete system construction.
Also the structure of the memory decoding logic configured over the PUs are designed.
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 7
2.1 Overview of the MPAA System Architecture
The MPAA system is designed to overcome some inefficient mechanisms of the conven-
tional SIMD machines in performing ANN applications. Specifically, design objectives of the
MPAA system are; 1) it should be easily integrated into any host system, from small personal
computers to multiprocessors, 2) it should provide minimum interaction overhead between the
MPAA and the host, 3) it should provide transparent structure to the programmers with the
conventional programming model, 4) it should allow multiple programs to be run in time multi-
plexed fashion without any program or data reloading, and 5) it should be constructed to utilize
inherent bandwidth given by the memory structure, eventually as the memory–based processor
array suitable for the computation model of ANNs.
The overall system structure consists of a host processor (HP), the HP memory module
(HM), a system bus, and a MPAA system as shown in Figure 1 (a). Thus, the MPAA system
can be interfaced into any host system, via its system bus. The MPAA system is constructed as
an interface unit (IU) and an array of processing units (PUs) with their associated PU memory
modules (PMs) as in Figure 1 (a). The IU consists of an interface logic (IL), an interface
processor (IP), and an IP memory module (IM) as its associated memory as shown in Figure 1
(b). In this system approach, system memory is physically divided into two modules, i.e., HM
and the shared memory (SM) as shown in Figure 1 (a). Here, HM is the main memory dedicated
to the HP and SM is a set of PMs and IM shared between the HP and PUs. In other words, SM
can be accessed by either the HP or PUs exclusively. Thus, SM is constructed as the dual ported
memory structure. In the view of the HP, SM is constructed as a portion of the HP’s single
linear address space. However, in the view point of each PU, SM is divided into independent
PMs associated with each PU and an IM associated with the IP. Thus, each PU can access its
own PM and the IP can access its own IM.
The IP, as the control unit of the MPAA system, controls the operation of every PU and
interacts with the HP. The IL coordinates the SM accesses between the HP and PUs by using
enable signal. Thus, the MPAA system can be configured as the two different operational modes,
i.e., simply as the memory or as the SIMD array. First, the MPAA system can be configured
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 8
as a portion of the HP’s memory. The HP inputs and outputs the data to and from the MPAA
system in the form of memory reads and writes by the arbitration of the IL. Therefore, the
MPAA system is considered as a part of the contiguous host memory address space. Second,
the MPAA system performs any data parallel operation as the SIMD array. Every PU in the
MPAA system can access its associated PM and thus can perform a SIMD instruction broadcast
on each of the data. Data parallel code blocks including any data required can be stored on the
IM and PMs (i.e., SM) at the program loading time.
2.2 System Operation
In this system approach, there exist three different types of processors, such as the HP,
the IP, and the PUs. Application programs can be classified into two major code blocks,
i.e., sequential code blocks performed by the HP and data parallel code blocks processed
by the PUs. As the interaction mechanism, control transfer between the HP and the MPAA
system is performed via conventional subroutine calling mechanism, causing the MPAA system
transparent to the programmers. This type of control transfer is called MPAA subroutine call, to
differentiate it from any conventional subroutine call. The overall execution flow of the MPAA
system can be classified into the following steps.
First, the HP compiles an application program and stores on the secondary storage. When
this program is executed, HP loads that program into its memory, i.e., HM and SM. When the
program is loaded into the memory, the sequential code blocks are loaded into the HM, and the
parallel code and data blocks are loaded into the IM and PMs of the SM, respectively. A single
address space viewed by the HP can be mapped onto a set of PMs by allowing each word to be
located linearly across PMs. Parallel code blocks are formed by a set of MPAA subroutines.
Then the HP starts executing the sequential code blocks. When the HP encounters any calling
instruction to initiate the MPAA, control is transferred to the IP. The HP suspends its operation
until the MPAA completes the execution of that subroutine. When a MPAA subroutine call
is invoked, target address to branch is the memory address in the IM corresponding to that
subroutine and this address is transferred to the IP. Then the IP starts executing instructions in
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 9
the IM.
Here, IP broadcasts sequentially parallel instructions to the PUs if needed. Then PUs can
execute instructions broadcast on their own data in the PMs. If the IP completes the MPAA
subroutine called by the HP, control is transferred to the HP again. In the following subsection,
the internal structure of the MPAA system is introduced.
2.3 Structure of the MPAA System
The memory–processor integrated arrays (Aimoto et al., 1996; Elliott, Snelgrove, &
Stumm, 1992; Gokhale, Holmes, & Iobst, 1995; Yamashita et al., 1994) can access their local
memories only based on the row–by–row decoding for the SIMD execution mode. For any
selected row, every processor located at each memory column can perform same operations
in parallel. However, the memory structure in the MPAA system is constructed as the two–
dimensional memory blocks divided by the number of PUs as shown in Figure 2 (a). Also, each
memory block consists of a certain amount of memory cells. For given any specific memory
block row or column address, every PU attached at each memory block column (row) can access
the memory location specified by the memory block row (column) address in parallel. Thus,
every PU can access the memory either by the memory block row or column address selectively
by using the multiplexer and the demultiplexer. This type of memory accessing pattern can
be supported by constructing the decoding logic to access row and column basis as shown in
Figure 2 (a).
In the MPAA system, a group of PUs with their associated memories can be integrated
into a single chip and this group is called a memory–based processor array block (MPAB).
Figure 2 (b) shows an example of the MPABs, each consisting of four PUs for simplicity. Each
PU is constructed as an ALU including a multiplier and an adder, a set of registers, and two
inter–layer connection ports to connect other PUs in neighbor MPABs as shown in Figure 2 (b).
An MPAB can perform any processing given to a single layer for a multilayer perceptron
algorithm with backpropagation learning and each PU can perform any operation assigned to
one neuron for neuron level parallelism. Thus, the MPAA system can be constructed by using
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 10
.
.
.
sense amp.
r cx cellsmemory block
sense amp.
r cx cellsmemory block
sense amp.
r cx cellsmemory block
sense amp.
r cx cellsmemory block
sense amp.
r cx cellsmemory block
sense amp.
r cx cellsmemory block
sense amp.
r cx cellsmemory block
sense amp.
r cx cellsmemory block
sense amp.
r cx cellsmemory block
mux/demux
mux/demux mux/demux mux/demux
mux/demuxmux/demux
mux/demux mux/demuxmux/demux
. . .0 1 C-1
bit PUc bit PUcbit PUc
1
0
R-1
TOP R
OW
DEC
OD
ER
TOP COLUMN DECODER
(a) A memory–based processor array block (MPAB)
l l lMPAB[ -1] MPAB[ ] MPAB[ +1]
PUmemory blockmemory cell
(b) A configuration of multi–MPAB system
Figure 2: The MPAA system.
the same number of MPABs as the number of layers required for given problems. Each PU in
an MPAB is connected to other PUs in neighbor MPABs by a bi–directional communication
path as shown in Figure 2 (b).
For a given multilayer perceptron algorithm, the following variables are defined to explain
the configuration of the multi–MPAB system.
� L : the number of layers for a given artificial neural network. The input layer is labeled
0, and is not counted inL. The output layer is labeledL, and layers 1 toL � 1 are the
hidden layers.
� Nl : the number of neurons at thel–th layer(0� l � L). The neurons between adjacent
layers are assumed to be fully connected.
� R� C : the number of memory blocks for an MPAB[l] at thel–th layer, whereR andC
are defined as the numbers of memory block rows and columns, respectively. The MPAB
of Figure 2 (b) has 4� 4 memory blocks represented by large boxes,.
� r� c : the number of memory cells for a memory block, wherer andc are defined as the
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 11
AC
1
S AR Ar
2 22log R log C log r
Figure 3: Address format.
numbers of memory cell rows and columns, respectively. A memory block of Figure 2
(b) has 2� 2 memory cells represented by small dashed boxes,.
As the above definition, the minimum number of MPABs required should beL, the
minimum number of PUs required in the MPAB[l] should be max(Nl�1; Nl), the MPAB[l]
at thel–the layer should have the entirerR � cC memory cells, the number of memory cell
columns for a memory block should be formed as the width of a data path for a PU, andR�C
should be larger thanNl �Nl�1.
2.4 Addressing Mechanism
When the MPAA system is used simply as the memory, both the row decoder and the
column decoder in Figure 2 (a) can be operated as the conventional memory decoder. However,
when the MPAA system is used as a SIMD array, every PU can access the data specified by
a memory block row or a memory block column and can process a SIMD operation. For the
SIMD mode, an MPAA address is generated by the address format consisting of four fields such
asS,AR(the address of a memory block row),AC(the address of a memory block column), and
Ar(the address of a memory cell row for a given memory block) as shown in Figure 3.S is a
one–bit field which selects either a memory block row or a memory block column. IfS is zero,
a memory block column addressed byAC is selected andAR is ignored. Otherwise, a memory
block row addressed byAR is selected andAC is ignored. The fourth field,Ar, is used to select
a memory cell row within a domain of each memory block addressed by one of the two fields,
AR andAC.
Actual MPAA addresses,AMPAA, by the row major order for the SIMD mode are obtained
as
AMPAA = S ��AR � r � c+ c � i
�+ (1� S) �
�AC � c+ r � c �C � j
�+Ar � c � C + k; (1)
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 12
layer 0
layer 2
layer 3
layer 1 x1[1]
x1[0] x2[0] x3[0]
x2[1] x3[1] x4[1]
x1[2] x2[2] x3[2]
x1[3] x2[3]
w11[1]
w21[1]
w31[1]
w41[1]
w12[1]
w22[1]
w32[1]
w42[1]
w13[1]
w23[1]
w33[1]
w43[1]
W[1] =
w31[2]
w21[2]
w11[2] w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
w21[3]
W[2] =
W[3] = w11[3] w12[3]
w22[3]
w13[3]
w23[3]w11[3] w12[3] w13[3]
x3[2]=f w3j[2]xj[1]
Figure 4: Multilayer perceptron with backpropagation learning.
wherei = 0;1; :::; C � 1, j = 0;1; :::; R� 1, andk = 0;1; :::; c� 1.
For example of Figure 2 (b), if four fields of the address format in the SIMD mode are
(1,2,x,0), each PU can access memory cells located at the first memory cell row represented by
� within each memory block of the third memory block row. Also, if four fields of the address
format in the SIMD mode are (0,x,1,1), each PU can access memory cells located at the second
memory cell row represented by� within each memory block of the second memory block
column.
3 Mapping Algorithms to MPAA
In this section, general ANN model is provided and classified into two major phases. Also,
effective mapping algorithms are designed and applied to the MPAA system. These algorithms
are based on both neuron and layer level parallelisms and exploit the layer level pipelined
operations.
3.1 ANN Model
The ANN computation can be classified into two phases:recall phaseand learning
phase. The recall phase updates activation values of neurons at each layer based on the network
topology which refers to the forward procedure. An example of the three–layer perceptron with
backpropagation learning is shown in Figure 4. Each neuron, say neuroni, for every layer, say
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 13
layerl, has an activation valuexi[l]. The activation value vectorX[l] for the layerl consists of
elementsxi[l] for 1 � i � Nl. Associated with each connection from the neuronj at the layer
l � 1 to the neuroni at the layerl, is a synaptic weightwij[l]. The weight matrixW [l] for the
layer l consists of elementswij[l] for 1 � i � Nl and 1� j � Nl�1. The recall phase can be
formally described as
xi[l] = f(hi[l]) = f
0@Nl�1X
j=1
wij[l]xj[l � 1]
1A ; (2)
wherel = 1;2; :::; L, 1 � i � Nl, andX[0] stands for an input pattern, andf is an activation
function which is usually a nonlinear sigmoid function given byf(x) = 1=(1 + e�x) and the
derivativef 0 = f(1� f).
The learning phase establishes the values of the synaptic weights. Two basic procedures
of the learning phase are the forward procedure identical to the recall phase and the backward
procedure, where the produced outputxi[L] is compared to the target outputti and an error
value�i[L] is propagated backward to update weight values. The backward procedure can be
obtained as Equations (3,4).
�j [l � 1] = f 0(hj [l � 1])dj[l] = f 0(hj [l� 1])NlXi=1
wij[l]�i[l]; (3)
wherel = L;L� 1; :::;2, 1� j � Nl�1, and�i[L] = f 0(hi[L])(ti � xi[L]).
wij[l] = wij[l] + ∆wij[l] = wij[l] + ��i[l]xj[l � 1]; (4)
wherel = L;L� 1; :::;1, 1� i � Nl, and 1� j � Nl�1.
3.2 Mapping Algorithms
ConsiderL layer perceptron with backpropagation learning consisting ofNl neurons at
thel–th layer(0� l � L). For the mapping processes, some assumptions for the MPAA system
and the ANN applications are described.
First, each memory block coordinated by(i; j) at MPAB[l] loadswij[l]. Second, each
PUj[l] (1� j � Nl�1) at the MPAB[l] uses the registers to storexj[l�1] anddj [l]. Each PUi[l]
(1 � i � Nl) at the MPAB[l] assigns the registers to store�i[l] andhi[l]. Third, each PU can
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 14
perform an operand fetch and computation in a single cycle. Fourth, two different strategies are
in common use for updating the weights in the network. In the first approach, the weights are
updated every cycle for the entire set of training patterns presented. In the second approach, the
network weights are updated continuously after each training pattern is presented. This method
might become trapped for a few atypical training patterns, but the advantage is that it does not
need to accumulate the error value over many patterns presented and allows a network to learn
a given task more quickly, if there is a lot of redundant information in the training patterns. A
disadvantage is that it requires more steps to update weights. The first approach is calledtrue
gradient method and the second approach is calledstochastic gradient method (Petrowski et
al., 1989). The second approach is chosen in this work. Finally,N is assumed to be the number
of neurons at the largest layer.
To perform the recall phase for the MPAA system, the forward procedure, FW-MPAA(l)
represented by pseudo code, is iteratively called forl = 1;2; :::; L at lines (1–3) as shown
in Algorithm 1 of Figure 5. To process FW-MPAA(l) in the MPAB[l], every PUj[l] for all
1 � j � Nl�1 computes weight–by–activation products in parallel by accessing the memory
block row iterativelyNl times as represented at lines (5–12) and these steps are illustrated in
Figure 6 (a),(b).
Then every PUi[l] for all 1 � i � Nl performs sum–of–products in parallel by accessing
the memory block column iterativelyNl�1 times as shown by lines (13–19) and the action is
illustrated in Figure 6 (c). Finally each PUi[l] for all 1� i � Nl finds its new activation value,
sends it to PUi[l + 1], and sendshi[l] to PUi[l + 1] in parallel if learning phase is assumed as
shown by lines (20–26).
As above processes, the MPAA system does not require any inter–PU communication
such as broadcast or shift operation so as to perform sum–of–products, and in turn eliminates
the transposition of weight matrix. This efficient mechanism is applicable to the learning phase.
Therefore, the number of computation steps required to recall a single input pattern on the
MPAA system can be obtained as
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 15
Algorithm 1 . RECALL PHASE ON MPAA
f call procedureFW-MPAAiterativelyg1 for l = 1 toL do2 FW-MPAA(l);3 endfor fline 1 forg
f forward procedure on the MPAAg4 procedure FW-MPAA(l)5 for i = 1 toNl do fweight–by–activation productg6 for all 1� j � Nl�1 do7 parbegin8 � PUj [l] readswij [l] by the row,9 PUj [l] computeswij [l]xj[l� 1];10 � PUj [l] writewij [l]xj[l� 1] by the row;11 parend fline 7parbeging12 endfor fline 5 forg
13 for j = 1 toNl�1 do fsum–of–productsg14 for all 1� i � Nl do15 parbegin16 � PUi[l] readswij [l]xj[l� 1] by the column;17 PUi[l] computeshi[l]+ = wij [l]xj[l� 1];18 parend fline 15parbeging19 endfor fline 13forg
20 for all 1� i � Nl do fnew activation valuesg21 parbegin22 � PUi[l] computesxi[l] = f(hi[l]);23 � PUi[l] sendsxi[l] to PUi[l+ 1] in the MPAB[l+ 1];24 � if learning phasethen25 PUi[l] sendshi[l] to PUi[l+ 1] in the MPAB[l+ 1];26 parend fline 21parbeging27 endprocedure fline 4procedureg
Algorithm 2 . LEARNING PHASE ON MPAA
f call procedureFW-MPAAiterativelyg1 for l = 1 toL do2 FW-MPAA(l);3 endfor fline 1 forg
f calculate error vector�[L] at the output layerg4 for all 1 � i � NL do ffor BWg5 parbegin6 � PUi[L] computes(ti � xi[L]);7 � PUi[L] computesf 0(hi[L]);8 � PUi[L] computes�i[L] = f 0(hi[L])(ti � xi[L]);9 parend fline 5parbeging
f call procedureBW-MPAAiterativelyg10 for l = L to 1do11 BW-MPAA(l);12 endfor fline 10forg
f backward procedure on the MPAAg13 procedure BW-MPAA(l)14 for j = 1 toNl�1 do fweight–by–error productg15 for all 1� i � Nl do16 parbegin17 � PUi[l] readswij[l] by the column, ffor BWg18 PUi[l] computeswij[l]�i[l]; ffor BWg19 � PUi[l] writeswij[l]�i[l] by the column; ffor BWg20 � PUi[l] writes�i[l]; ffor updatesg21 parend fline 16parbeging22 endfor fline 14forg
23 for i = 1 toNl do24 for all 1� j � Nl�1 do25 parbegin26 � PUj[l] readswij[l]�i[l] by the row, ffor BWg27 PUj[l] computesdj[l]+ = wij [l]�i[l];28 � PUj[l] reads�i[l]; ffor updatesg29 � PUj[l] computes��i[l]; ffor updatesg30 � PUj[l] computes∆wij [l] = ��i[l]xj [l� 1];31 � PUj[l] readswij[l] by the row, ffor updatesg32 PUj[l] computeswij [l]+ = ∆wij [l];33 � PUj[l] writeswij[l] by the row; ffor updatesg34 parend fline 25parbeging35 endfor fline 23forg
36 for all 1 � j � Nl�1 do ferror vector at layerl� 1g37 parbegin38 � PUj[l] computesf 0(hj [l� 1]);39 � PUj[l] computes�j[l� 1] = f 0(hj[l� 1])dj [l];40 � PUj[l] sends�j[l� 1] to PUj[l� 1] in the MPAB[l� 1];41 parend fline 37parbeging42 endprocedure fline 13procedureg
Figure 5: Algorithms for the recall and the learning phases on the MPAA system.
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 16
w11[2]
w21[2]
w31[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]w11[2]x1[1] w12[2]x2[1] w13[2]x3[1] w14[2]x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
(a) Readsw1j [2] and computesw1j [2]xj[1]
w11[2]
w21[2]
w31[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]w11[2]x1[1] w12[2]x2[1] w13[2]x3[1] w14[2]x4[1]
w11[2]x1[1] w12[2]x2[1] w13[2]x3[1] w14[2]x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
(b) Writesw1j [2]xj[1]
w11[2]
w21[2]
w31[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]
w12[2]x2[1] w13[2]x3[1] w14[2]x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
w22[2]x2[1]
w32[2]x2[1]
w23[2]x3[1]
w33[2]x3[1]
w24[2]x4[1]
w34[2]x4[1]
h1[2]+=w11[2]x1[1] h2[2]+=w21[2]x1[1] h3[2]+=w31[2]x1[1]
w31[2]x1[1]
w21[2]x1[1]
w11[2]x1[1]
(c) Readswi1[2]x1[1] and computeshi[2]+ = wi1[2]x1[1]
1[2] 2[2] 3[2]1[2] 2[2] 3[2]
w11[2]
w21[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
w11[2] w21[2]
w31[2]
w31[2]
(d) Readsw1j [2] and computeswi1[2]�i[2]
1[2] 2[2] 3[2]
w11[2]
w21[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
w31[2]w31[2] 3[2]
w21[2] 2[2]
w11[2] 1[2]
(e) Writeswi1[2]�i[2]
1[2] 2[2] 3[2]
1[2]w11[2]
2[2]w21[2]
3[2]w31[2]
w11[2]
w21[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
w31[2]
2[2]
3[2]
1[2]
(f) Writes �i[2]
1[2] 2[2] 3[2]
2[2]w21[2]
3[2]w31[2]
1[2]
2[2]
3[2]
1[2] 1[2] 1[2]
2[2] 2[2] 2[2]
3[2] 3[2] 3[2]
2[2]
2[2] 2[2]
2[2] 2[2]
2[2]
1[2]d1[2]+=w11[2] 1[2]d2[2]+=w12[2] 1[2]d3[2]+=w13[2] 1[2]d4[2]+=w14[2]
w11[2]
w21[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
w31[2]
w22[2]
w32[2]
w23[2]
w33[2]
w24[2]
w34[2]
w11[2] 1[2] w12[2] 1[2] w13[2] 1[2] w14[2] 1[2]
(g) Readsw1j [2]�1[2] and computesdj [2]+ = w1j [2]�1[2]
1[2] 2[2] 3[2]
2[2]w21[2]
3[2]w31[2]
2[2]
3[2]
2[2] 2[2] 2[2]
3[2] 3[2] 3[2]
2[2]
2[2] 2[2]
2[2] 2[2]
2[2]
1[2] 1[2]1[2]1[2]
w11[2]
w21[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
w31[2]
w22[2]
w32[2]
w23[2]
w33[2]
w24[2]
w34[2]
w11[2] 1[2] w12[2] 1[2] w13[2] 1[2] w14[2] 1[2]
w12[2]= w13[2]= w14[2]=
1[2] 1[2] 1[2] 1[2]
w11[2]= x1[1] x4[1]x3[1]x2[1]
(h) Reads�1[2] and computes∆w1j [2] = �xj [1]�1[2]
1[2] 2[2] 3[2]
2[2]w21[2]
3[2]w31[2]
1[2]
2[2]
3[2]
1[2] 1[2] 1[2]
2[2] 2[2] 2[2]
3[2] 3[2] 3[2]
2[2]
2[2] 2[2]
2[2] 2[2]
2[2]
w11[2]+= w11[2]
w11[2]
w21[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
w31[2]
w22[2]
w32[2]
w23[2]
w33[2]
w24[2]
w34[2]
w11[2] 1[2] w12[2] 1[2] w13[2] 1[2] w14[2] 1[2]
w12[2]+= w13[2]+= w14[2]+=w12[2] w13[2] w14[2]
(i) Readsw1j [2] and computesw1j [2]+ = ∆w1j [2]
1[2] 2[2] 3[2]
2[2]w21[2]
3[2]w31[2]
1[2]
2[2]
3[2]
1[2] 1[2] 1[2]
2[2] 2[2] 2[2]
3[2] 3[2] 3[2]
2[2]
2[2] 2[2]
2[2] 2[2]
2[2]
w11[2]
w21[2]
w12[2]
w22[2]
w32[2]
w13[2]
w23[2]
w33[2]
w14[2]
w24[2]
w34[2]
x1[1] x2[1] x3[1] x4[1]
Data Layout in Memory Blocks
PU1[2] PU2[2] PU3[2] PU4[2]
w31[2]
w22[2]
w32[2]
w23[2]
w33[2]
w24[2]
w34[2]
w11[2] 1[2] w12[2] 1[2] w13[2] 1[2] w14[2] 1[2]
(j) Writesw1j [2]
Figure 6: Computation steps in the MPAB[2] for the simple network as in Figure 4.
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 17
LXl=1
8<:Nl
0@multiplyz}|{
1 +
memoryz}|{1
1A+
addz }| {Nl�1 +
sigmoidz}|{1 +
comm:z}|{1
9=;
� L
�3
Lmaxl=0
(Nl) + 2
�= L(3N + 2) = O(N ): (5)
Algorithm 2 of Figure 5 for the learning phase by the MPAA system consists of three
major operations: calling of the forward procedure, FW-MPAA(l), as shown in Algorithm 1 of
Figure 5 forl = 1;2; :::; L as lines (1–3), finding of error values at the output layer as lines
(4–9), and calling of the backward procedure, BW-MPAA(l), for l = L;L � 1; :::;1 as lines
(10–12). To process BW-MPAA(l) in the MPAB[l], every PUi[l] for all 1 � i � Nl computes
weight–by–error products and writes error values for the weight update process in parallel by
accessing the memory block column iterativelyNl�1 times as lines (14–22) and these steps are
illustrated in Figure 6 (d–f). Then every PUj[l] for all 1� j � Nl�1 performs sum–of–products
and updates weight values in parallel by accessing the memory block row iterativelyNl times
as illustrated in Figure 6 (g–j). Finally every PUj[l] for all 1 � j � Nl�1 finds error values at
the lower layer and sends them to PUj[l� 1] in parallel as lines (36–41).
According to Algorithm 2 of Figure 5, the number of computation steps required to learn
a single pattern on the MPAA system can be obtained as
forward procedurez }| {LXl=1
(2Nl +Nl�1 + 3)+
calculates �i[L]z}|{3
+
backward procedurez }| {LXl=1
8<:Nl�1
0@multiplyz}|{
1 +
memoryz}|{2
1A +Nl
0@
addz}|{2 +
multiplyz}|{2 +
memoryz}|{2
1A+
sigmoidz}|{1 +
multiplyz}|{1 +
comm:z}|{1
9=;
� 12LN + 6L+ 3 = O(N ): (6)
The MPAA system can support the layer level parallelism and can perform both the recall
and the learning phases in the pipelined fashion. Due to the layer level pipelining, the number
of pipeline stages for the recall phase isL. To perform the pipelined recall phase for the MPAA
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 18
Algorithm 3 . PIPELINED RECALL PHASE ON MPAA
f call procedureFW-MPAAin parallelg1 for all 1� l � L do2 parbegin3 � FW-MPAA(l);4 parend fline 2parbeging
Algorithm 4 . PIPELINED LEARNING PHASE ON MPAA
f calculate error vector�[L] at the output layerg1 for all 1� i � NL do ffor BWg2 parbegin3 � PUi[L] computes(ti � xi[L]);4 � PUi[L] computesf 0(hi[L]);5 � PUi[L] computes�i[L] = f 0(hi[L])(ti � xi[L]);6 parend fline 2parbeging
f call procedurePIPELINED-FW-AND-BW-MPAAin parallelg7 for all 1� l � L do8 parbegin9 � PIPELINED-FW-AND-BW-MPAA(l);10 parend fline 8parbeging
f PIPELINED-FW-AND-BW-MPAAprocedure on the MPAAg11 procedurePIPELINED-FW-AND-BW-MPAA(l)12 � BW-MPAA(l);13 � FW-MPAA(l);14 endprocedure fline 11procedureg
Figure 7: Pipelined algorithms for the recall and the learning phases on the MPAA system.
system as shown in Algorithm 3 of Figure 7, the forward procedure, FW-MPAA(l), is simply
called in parallel for all 1� l � L. However, (L� 1) stages of the pipeline should be initially
filled. According to Algorithm 3 of Figure 7, the number of computation steps required to recall
p patterns can be obtained as
0@fill the pipelinez }| {
L� 1 +
p patternsz}|{p
1A0@steps at each layerz }| {
3N + 2
1A = O(pN ): (7)
Because the learning phase consists of the forward and the backward procedures, the
number of pipeline stages for the learning phase of the MPAA system supporting layer level
parallelism is 2L. Algorithm 4 of Figure 7 for the pipelined learning phase algorithm by the
MPAA system performs three major operations for the learning phase in parallel by the layer
level. However, (2L�1) stages of the pipeline should be initially filled. According to Algorithm
4 of Figure 7, the number of computation steps required to learnp patterns can be obtained as
0@fill the pipelinez }| {
2L� 1 +
p patternsz}|{p
1A0@steps at each layerz }| {
12N + 9
1A = O(pN ): (8)
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 19
4 Comparison with Various Architectures and Mapping Algo-rithms
In this section, mapping algorithms applied to various parallel machines including one–
dimensional SIMD arrays, two–dimensional SIMD arrays, systolic ring structures, and hyper-
cube systems, are provided and compared in terms of the number of computation steps and
the cost. In order to compare the performance of the proposed schemes with those of previous
works in fair, several algorithms proposed in (Kung & Hwang, 1989; Singer, 1990; Svensson &
Nordstrom, 1990) are rewritten in similar manner to the algorithms on the MPAA system.
4.1 Previous Works
Singer (1990) presented five algorithms showing weight level parallelism through the
Connection Machine. The first method calledgrid–based implementationis considered in this
work. The leftmost column of PUs contains the activation value for each of the input units and
the topmost row of PUs contains the activation values for the hidden layer. The weight matrix
is distributed over the rest of PUs. Therefore the number of total PUs required is(N + 1)2.
The forward procedure requires horizontal broadcast and vertical summation, whereas the
backward procedure requires vertical broadcast and horizontal summation. This paper rewrites
the mapping algorithms of the forward and the backward procedures on the two–dimensional
SIMD array (Singer, 1990) in similar manner to algorithms on the MPAA system as shown in
Figure 8. Each PU in Algorithm 5 and 6 is labeled from PU0;0 to PUN;N .
An algorithm (Svensson & Nordstrom, 1990) calledcommunication by broadcast sup-
porting neuron level parallelism mapped onto one–dimensional SIMD array withN PUs is
considered in this work. In the forward procedure, each PU broadcasts its own activation value
to all other PUs, and then multiplies the value broadcast by a weight value in its local memory,
whereas in the backward procedure it multiplies its own error value and adds the result to a
running sum required across the PUs instead of within each PU. In order to improve inefficient
inter–PU communications in the backward procedure, an adder tree hardware was proposed.
Algorithm 7 and 8 of Figure 9 show the rewritten algorithms of the forward and the backward
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 20
Algorithm 5 . FORWARD PROCEDURE ON 2–D SIMD
f forward procedure on the 2–D SIMDg1 procedure FW-2D-SIMD(l)2 if l = odd then3 begin4 for i = 1 toNl do5 for all 1� j � Nl�1 do6 parbegin7 � PUi�1;j sends downxj[l� 1] to PUi;j ;8 parend fline 6parbeging9 endfor fline 4 forg
10 for all 1� i � Nl , 1 � j � Nl�1 do11 parbegin12 � PUi;j readswij [l];13 � PUi;j computeswij [l]xj[l� 1];14 � if learning phasethen15 PUi;j writesxj[l� 1];16 parend fline 11parbeging
17 for k = 1 toNl�1 do18 for all 1� i � Nl, 1 � j � Nl�1 do19 parbegin20 � if j � (Nl�1 � k + 1) then21 PUi;j sends leftwij [l]xj[l� 1] to PUi;j�1;22 � PUi;0 computeshi[l]+ = wik [l]xk[l� 1];23 parend fline 19parbeging24 endfor fline 17forg
25 for all 1� i � Nl , 1 � j � Nl�1 do26 parbegin27 � PUi;0 computesxi[l] = f(hi[l]);28 if learning phasethen29 begin30 � PUi;0 writeshi[l];31 � PU0;j writesxj [l� 1];32 endif fline 29beging33 parend fline 26parbeging34 endif fline 3beging
35 else f l = eveng36 OMITTED because of similar codes in casel = odd37 endprocedure fline 1procedureg
Algorithm 6. BACKWARD PROCEDURE ON 2–D SIMD
f backward procedure on the 2–D SIMDg1 procedure BW-2D-SIMD(l)2 if l = oddthen3 begin4 for j = 1 toNl�1 do ffor BWg5 for all 1� i � Nl do6 parbegin7 � PUi;j�1 sends right�i[l] to PUi;j ;8 parend fline 6parbeging9 endfor fline 4 forg
10 for all 1� i � Nl , 1 � j � Nl�1 do ffor BWg11 parbegin12 � PUi;j readswij [l];13 � PUi;j computeswij [l]�i[l];14 parend fline 11parbeging
15 for k = 1 toNl do ffor BWg16 for all 1� i � Nl, 1 � j � Nl�1 do17 parbegin18 � if i � (Nl � k + 1) then19 PUi;j sends upwij�i[l] to PUi�1;j ;20 � PU0;j computesdj[l]+ = wkj [l]�k[l];21 parend fline 17parbeging22 endfor fline 15forg
23 for all 1� i � Ni , 1 � j � Nl�1 do ffor updatesg24 parbegin25 � PUi;j computes��i[l];26 � PUi;j readsxj[l� 1];27 � PUi;j computes∆wij [l] = ��i[l]xj[l� 1];28 � PUi;j readswij [l];29 � PUi;j computeswij [l]+ = ∆wij [l];30 � PUi;j writeswij [l];31 parend fline 24parbeging
32 for all 1� j � Nl�1 do ffor BWg33 parbegin34 � PU0;j readshj [l� 1];35 � PU0;j computesf 0(hj [l� 1]);36 � PU0;j computes�j [l� 1] = f 0(hj[l� 1])dj[l];37 parend fline 33parbeging38 endif fline 3beging
39 else f l = eveng40 OMITTED because of similar codes in casel = odd41 endprocedure fline 1procedureg
Figure 8: The forward and the backward procedures on the 2–D SIMD system (Singer, 1990).
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 21
Algorithm 7 . FORWARD PROCEDURE ON 1–D SIMD
f forward procedure on the 1–D SIMDg1 procedure FW-1D-SIMD(l)2 for j = 1 toNl�1 do3 � PUj sendsxj[l� 1] toControl Unit;4 � Control Unit broadcastsxj[l� 1]
to PUi (1� i � Nl);5 for all 1� i � Nl do6 parbegin7 � PUi readswij [l];8 � PUi computeswij [l]xj[l� 1];9 � PUi computeshi[l]+ = wij[l]xj[l� 1];10 parend fline 6parbeging11 endfor fline 2 forg
12 for all 1 � i � Nl, 1� j � Nl�1 do13 parbegin14 � PUi computesxi[l] = f(hi[l]);15 if learning phasethen16 begin17 � PUi writeshi[l];18 � PUj writesxj[l� 1];19 endif fline 16beging20 parend fline 13parbeging21 endprocedure fline 1procedureg
Algorithm 8. BACKWARD PROCEDURE ON 1–D SIMD
f backward procedure on the 1–D SIMDg1 procedure BW-1D-SIMD(l)2 for j = 1 toNl�1 do ffor BWg3 for all 1� i � Nl do4 parbegin5 � PUi readswij[l];6 � PUi computeswij[l]�i[l];7 � Adder–Tree hardware computes
dj [l]+ = wij[l]�i[l] in log2Nl steps;8 parend fline 4parbeging
9 for i = 1 toNl do ffor updatesg10 � PUi sends�i[l] toControl Unit;11 � Control Unit broadcasts�i[l]
to PUj(1� j � Nl�1;12 for all 1� i � Ni, 1� j � Nl�1 do13 parbegin14 � PUj computes��i[l];15 � PUi readsxj[l� 1];16 � PUj computes∆wij [l] = ��i[l]xj[l� 1];17 � PUi readswij[l];18 � PUi computeswij[l]+ = ∆wij [l];19 � PUi writeswij[l];20 parend fline 13parbeging21 endfor fline 9 forg
22 for all 1� j � Nl�1 do ffor BWg23 parbegin24 � PUj readshj [l� 1];25 � PUj computesf 0(hj [l� 1]);26 � PUj computes�j[l� 1] = f 0(hj [l� 1])dj[l];27 parend fline 23parbeging28 endprocedure fline 1procedureg
Figure 9: The forward and the backward procedures on the 1–D SIMD system (Svensson &Nordstrom, 1990).
procedures on the one–dimensional SIMD array.
Kung and Hwang (1989) proposed an algorithm considering both neuron and layer level
parallelisms mapped onto the cascaded systolic ring structure which is constructed as the same
number of systolic ring structures withN PUs as the number of layers. Therefore the number
of total PUs required isLN . Layer level pipelined operations for patterns are possible for the
recall phase by the cascaded systolic ring structure, but it is impossible for the learning phase.
The forward procedure requires left shifting of each PU’s activation value and the backward
procedure requires left shifting of the accumulated sum. Algorithm 9 and 10 of Figure 10
show in detail the forward procedure and the backward procedure on the cascaded systolic ring
structure in a similar manner to algorithms for the MPAA system.
Malluhi, Bayoumi, and Rao (1995) proposed an algorithmic mapping technique to imple-
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 22
Algorithm 9 . FORWARD PROCEDURE ON SYSTOLIC
f define macro for indexg1 #definem(x) (((x+ k � 1) mod Nl�1) + 1)
f forward procedure on the cascaded systolic ringg2 procedure FW-SYSTOLIC-RING(l)3 for k = 0 toNl�1 � 1 do4 for all 1� i � Nl do5 parbegin6 � PUi[l] readswim(i)[l];7 � PUi[l] computesxm(i)[l� 1]wim(i)[l];8 � PUi[l] computeshi[l]+ = wim(i)[l]xm(i)[l� 1];9 � PUj [l] sends leftxm(j)[l� 1]
to PU(((j�2) mod Nl�1)+1)[l];10 parend fline 5parbeging11 endfor fline 3 forg
12 for all 1� i � Nl , 1 � j � Nl�1 do13 parbegin14 � PUi[l] computesxi[l] = f(hi[l]);15 � PUi[l] sendsxi[l+ 1] to PUi[l+ 1];16 � if learning phasethen17 PUi[l] sendshi[l] to PUi[l+ 1];18 parend fline 13parbeging19 endprocedure fline 2procedureg
Algorithm 10 . BACKWARD PROCEDURE ON SYSTOLIC
f define macro for indexg1 #definem(x) (((x+ k � 1)mod Nl�1) + 1)
f backward procedure on the cascaded systolic ringg2 procedure BW-SYSTOLIC-RING(l)3 for k = 0 toNl�1 � 1 do ffor BWg4 for all 1� i � Nl, 1� j � Nl�1 do5 parbegin6 � PUi[l] readswim(i)[l];7 � PUi[l] computes�i[l]wim(i)[l];8 � PUj[l] computesdm(i)[l]+ = �i[l]wim(i)[l];9 � PUj[l] sends leftdm(j)[l]
to PU(((j�2) mod Nl�1)+1)[l];10 parend fline 5parbeging11 endfor fline 3 forg
12 for k = 0 toNl�1 � 1 do ffor updatesg13 for all 1� i � Nl, 1� j � Nl�1 do14 parbegin15 � PUi[l] computes��i[l];16 � PUi[l] computes∆wim(i)[l] = ��i[l]xm(i)[l� 1];17 � PUj[l] sends leftxm(j)[l� 1]
to PU(((j�2) mod Nl�1)+1)[l];18 � PUi[l] readswim(i)[l];19 � PUi[l] computeswim(i) [l]+ = ∆wim(i)[l];20 � PUi[l] writeswim(i)[l];21 parend fline 14parbeging22 endfor fline 12forg
23 for all 1 � j � Nl�1 do ffor BWg24 parbegin25 � PUj[l] computesf 0(hj [l� 1]);26 � PUj[l] computes�j[l� 1] = f 0(hj[l� 1])dj[l];27 � PUj[l] sends�j[l� 1] to PUj[l� 1];28 parend fline 24parbeging29 endprocedure fline 2procedureg
Figure 10: The forward and backward procedures on the cascaded systolic ring structure (Kung& Hwang, 1989).
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 23
Table 1: Comparison with other mapping schemes.architecture no. of what level no. of comp. no. of comp. no. of comp. no. of comp. steps& algorithms processors of parall– steps for one steps forp pattern steps for one for p pattern learning
elism ? pattern recall recall (pipelining ?) pattern learning (pipelining ?)
2–D SIMD (N + 1)2 weight L(3N + 3) NO ! 6LN + 17L + 3 NO !
= O(N2) = O(N) pL(3N + 3) = O(N) p(6LN + 17L + 3)= O(pN) = O(pN)
1–D SIMD N = O(N) neuron L(5N + 1) NO ! LN log2 N+ NO != O(N) pL(5N + 1) 15LN + 6L + 3 p(LN log2 N+
= O(pN) = O(N log2 N) 15LN + 6L + 3)= O(pN log2 N)
Systolic Ring LN = O(N) neuron, L(4N + 2) YES ! 14LN + 6L + 3 NO !layer = O(N) p(4N + 2) = O(N) p(14LN + 6L + 3)
+(L� 1)(4N + 2) = O(pN)= O(pN)
Hypercube 4N2 = O(N2) weight, 2L(log2 N + 2) YES ! 4L(log2 N + 3) NO !MIMD layer = O(log2 N) (2L(log2 N + 2) + 1 = O(log2 N) 4pL(log2 N + 3)
+2 log2 N)b p2 log2 N+2
c = O(p log2 N)
+2L(log2 N + 2)+(p mod (2 log2N + 2))�1 = O(p log2 N)
MPAA LN = O(N) neuron, L(3N + 2) YES ! 12LN + 6L + 3 YES !layer = O(N) p(3N + 2)+ = O(N) p(12N + 9)+
(L� 1)(3N + 2) (2L � 1)(12N + 9)= O(pN) = O(pN)
ment the multilayer perceptron with backpropagation learning and the Hopfield ANN models
on the mesh–appendixed tree (MAT) structure, and then embedded into the hypercube MIMD
architecture, considering both weight and layer level parallelisms. It can take the optimal
computation steps,O(log2N), for both the recall and the learning phases. However, those
algorithmic steps are obtained at the expense of 4N2 MIMD processors. Because the maximum
number of patterns that can be concurrently placed in the pipeline is 2 log2N + 2, it requires
pooled updates in the pipelined fashion for consecutive patterns. Also it can support the pattern
pipelining only for the recall phase.
The number of computation steps on above architectures with corresponding mapping
algorithms can be obtained in a similar manner to the mapping algorithms on the MPAA system
as shown in Table 1.
4.2 Performance Comparison
Various architectures for ANN with corresponding algorithms including the MPAA system
are compared in terms of the cost as well as the number of computation steps. The cost of solving
an ANN problem on any specific parallel machine is defined as the product of the number of
computation steps and the number of processors used (Kumar et al., 1994). The cost reflects
the sum of the numbers of computation steps for all processors, where each number stands for
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 24
computation steps that each processor spends solving the problem. Table 1 shows the number
of computation steps for the recall phase to process a single pattern, called one pattern recall
phase, in the same way,p pattern recall phase, one pattern learning phase, andp pattern learning
phase for various architectures and algorithms as explained in the previous sections 3.2 and
4.1. Each parameter is given values such asN = 64; L = 3, andt = 1000. Also, changes in
these parameters influence the number of processors, the number of layers, and the number of
training patterns. Figure 11 and Figure 12 compare the number of computation steps and the
cost, respectively, of various architectures and algorithms according to Table 1.
As shown in Figure 11, the performance of the hypercube (Malluhi, Bayoumi, & Rao,
1995) is outstanding over the other schemes in terms of the number of computation steps.
However, it is not true of the cost as shown in Figure 12. Therefore, the hypercube scheme
is not cost–effective. The number of computation steps of the one–dimensional SIMD array
(Svensson & Nordstrom, 1990) is larger than the other schemes. The MPAA system is the best
in terms of the number of computation steps among those SIMD configurations. In terms of
the cost and performance, the MPAA system is the best among all the system types illustrated.
Figure 11 (b) and (e) show that the number of computation steps of the MPAA system hardly
increases as the number of layers increases. This phenomenon is caused by the fact that the
MPAA system can support the layer level pipelined parallelism for both the recall phase and the
learning phases. Because the mapping method to the cascaded systolic ring structure (Kung &
Hwang, 1989) can support the layer level pattern pipelining for only the recall phase, it provides
a good performance only for the recall phase as shown in Figure 11 (b).
As shown in Figure 12, although the number of computation steps of the one–dimensional
SIMD array (Svensson & Nordstrom, 1990) is larger than the other schemes, the cost of it is
the third best and the second best in the point of the recall and the learning phases, respectively.
The one–dimensional SIMD array (Svensson & Nordstrom, 1990) must be more cost–effective
than the two–dimensional SIMD array (Singer, 1990) and the hypercube (Malluhi, Bayoumi,
& Rao, 1995). Similarly, the two–dimensional SIMD array (Singer, 1990) is inferior to the
other schemes in terms of cost–effectiveness. Also, though the cost of the MPAA system for
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 25
the recall phase is similar to those of other schemes (Malluhi, Bayoumi, & Rao, 1995; Kung
& Hwang, 1989; Svensson & Nordstrom, 1990) excluding the two–dimensional SIMD array
(Singer, 1990), the cost of the MPAA system for the learning phase is much smaller than those
of the others, because the other schemes cannot support any layer level pipelined parallelism for
the learning phase.
Consequently, the cost of the MPAA is superior to those of the others owing to the novel
architecture based on the memory and processor integration, resulting in the elimination of
inter–PU communications and matrix transpositions. Furthermore, can be provided by efficient
algorithms supporting the layer level pipelining for both the recall and the learning phases. The
MPAA system can reduce 24.81%� 98.49% of the cost for one thousand training patterns
obtained by any other architectures with their corresponding algorithms for an example ANN
consisting of three layer perceptron with 64 neurons at the largest layer.
5 Conclusions
An effective architecture, the MPAA system, and its associated algorithms for the artificial
neural networks are presented in this work. The proposed MPAA system provides an efficient
mechanism for the matrix–by–vector operations without the inter–PU communications and
the matrix transpositions. The proposed algorithms can exploit both neuron and layer level
parallelisms, and also allows the layer level pipelining for both the recall and the learning
phases. The asymptotic time complexities of the proposed algorithms are evaluated to verify
the effectiveness of the MPAA system. The performance of the hypercube scheme is shown to
be outstanding, but it is not cost effective. Graphs of MPAA cost show it to be superior in each
phase even though its effect is not so dramatical. Consequently, it is verified that the proposed
scheme has a relative performance improvement over typical four parallel architectures with
their corresponding algorithms in terms of the cost.
REFERENCES
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 26
Aimoto, Y., et. al. (1996). A 7.68GIPS 3.84GB/s 1W parallel image processing RAM
integrating a 16Mb DRAM and 128 processors,Dig. Tech. Papers, 1996 IEEE Int’l
Solid–State Circuit Conf., pp. 372 – 373.
El–Amawy, A. & Kulasinghe, P. (1997). Algorithmic mapping of feedforward neural
networks onto multiple bus systems,IEEE Trans. Parallel and Distributed Systems,
8(2), 130 – 136.
Elliott, D., Snelgrove, M., & Stumm, M. (1992). Computational RAM: a memory-SIMD
hybrid and its application to DSP,IEEE 1992 Custom Integrated Circuit Conf., pp.
30.6.1 – 30.6.4.
Ghosh, J. & Hwang, K. (1989). Mapping neural networks onto message–passing multi-
computers,J. Parallel and Distributed Computing, 6, 291 – 330.
Gokhale, M., Holmes, B., & Iobst, K. (1995). Processing in memory: the terasys
massively parallel PIM array,IEEE Computer, 28(4), 23 – 31.
Inoue, K., Nakamura, H., & Kawai, H. (1995). A 10Mb frame buffer memory with
Z–compare and A–blend units,IEEE J. of Solid–State Circuits, 30(12), 1563 –
1568.
Kumar, V., Grama, A., Gupta, A., & Karypis, G. (1994a).Introduction to parallel com-
puting, Design and analysis of algorithms. The Benjamin/Cummings Publishing
Company, Inc.
Kumar, V., Shekhar, S., & Amin, M.B. (1994b). A scalable parallel formulation of the
backpropagation algorithm for hypercubes and related architectures,IEEE Trans.
Parallel and Distributed Systems, 5(10), 1073 – 1090.
Kung, S.Y. & Hwang, J.N. (1989). A unified systolic architecture for artificial neural
networks,J. Parallel and Distributed Computing, 6, 358 – 387.
Lin, W., Prasanna, V.K., & Przytula, K.W. (1991). Algorithmic mapping of neural
network models onto parallel SIMD machines,IEEE Trans. Computers, 40(12),
1390 – 1401.
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 27
Malluhi, Q.M., Bayoumi, M.A., & Rao, T.R.N. (1995). Efficient mapping of ANNs on
Hypercube massively parallel machines,IEEE Trans. Computers, 44(6), 769 – 779.
Naylor, S. & Jones, S. (1994). A performance model for multilayer neural networks in
linear arrays,IEEE Trans. Parallel and Distributed Systems, 5(12), 1322 – 1328.
Nordstrom, T. & Svensson, B. (1992). Using and designing massively parallel computers
for artificial neural networks,J. Parallel and Distributed Computing, 14(3), 260 –
285.
Petrowski, A., Personnaz, L., Dreyfus, G., & Girault, C. (1989). Parallel implementations
of neural networks simulations,Hypercube and Distributed Computers, North–
Holland, New York, pp. 205 – 218.
Shimizu, T., et. al. (1996). A multimedia 32b RISC microprocessor with 16Mb DRAM,
Dig. Tech. Papers, 1996 IEEE Int’l Solid–State Circuit Conf., pp. 216 – 217.
Kogge, P.M. (1994). Execube – a new architecture for scalable MPPs,Proc. Int’l Conf.
Parallel Processing, vol. I, pp. 77 – 84.
Singer, A. (1990). Implementations of artificial neural networks on the Connection
Machine,Parallel Computing, 14(3), 305 – 316.
Svensson, B. & Nordstrom, T. (1990). Execution of neural network algorithms on an
array of bit–serial processors,10th Int’l Conf. Pattern Recognition, Comp. Arch.
for Vision and Pattern Recognition, vol. II. Atlantic City, NJ, pp. 501 – 505.
Wah, B.W. & Chu, L. (1990). Efficient mapping of neural networks on multicomputers,
Proc. Int’l Conf. Parallel Processing, vol. I, pp. 234 – 241.
Yamashita, N., et. al. (1994). A 3.84 GIPS integrated memory array processor with 64
processing elements and a 2-Mb SRAM,IEEE J. of Solid-State Circuits, 29(11),
1336 – 1342.
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 28
1000
10000
100000
1e+06
1e+07
20 40 60 80 100 120 140 160 180 200
Number of neurons at the largest layer (N )
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(a) Recall phase(L = 3; t = 1000)
1000
10000
100000
1e+06
1e+07
1 2 3 4 5 6 7
Number of layers (L)
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(b) Recall phase(N = 64; t = 1000)
1000
10000
100000
1e+06
1e+07
1 10 100 1000 10000
Number of patterns (p)
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(c) Recall phase(N = 64;L = 3)
1000
10000
100000
1e+06
1e+07
20 40 60 80 100 120 140 160 180 200
Number of neurons at the largest layer (N )
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(d) Learning phase(L = 3; t = 1000)
1000
10000
100000
1e+06
1e+07
1 2 3 4 5 6 7
Number of layers (L)
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(e) Learning phase(N = 64; t = 1000)
1000
10000
100000
1e+06
1e+07
1 10 100 1000 10000
Number of patterns (p)
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(f) Learning phase(N = 64;L = 3)
Figure 11: Comparison with other mapping schemes in terms of the number of computationsteps.
Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 29
100000
1e+06
1e+07
1e+08
1e+09
1e+10
20 40 60 80 100 120 140 160 180 200
Number of neurons at the largest layer (N )
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(a) Recall phase(L = 3; t = 1000)
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1 2 3 4 5 6 7
Number of layers (L)
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(b) Recall phase(N = 64; t = 1000)
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1 10 100 1000 10000
Number of patterns (p)
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(c) Recall phase(N = 64;L = 3)
100000
1e+06
1e+07
1e+08
1e+09
1e+10
20 40 60 80 100 120 140 160 180 200
Number of neurons at the largest layer (N )
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(d) Learning phase(L = 3; t = 1000)
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1 2 3 4 5 6 7
Number of layers (L)
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(e) Learning phase(N = 64; t = 1000)
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1 10 100 1000 10000
Number of patterns (p)
2D SIMD [1]
1D SIMD [2]
SYSTOLIC [3]
HYPERCUBE [4]
MPAA
(f) Learning phase(N = 64;L = 3)
Figure 12: Comparison with other mapping schemes in terms of the cost.