Abstract
An Elght-Proeesaor ChIp
ror a M&88iveJy ParaHeJ Maehlne
David Elliot Shaw Theodore M. Sabety
Department of Computer SCience Columbia Universlty
New York, NY 10027 212-280-8100
CXS-13J-34
This paper descrIbes a y"LSI chip that serves as the basis for a masslvely parallel
tree machine called :'-lON-VON 3. The chlp, whlch is lmplemented 10 3-micron
n~fOS technology :ontalOS eight ~blt processing elements (PE's), each embodYIng
64 bytes of static RA...\L Slgmficant features of the design Include: an unusually
high processor density, a novel I/O sWitch that allows the machine to dynamically
reconilgure to r-=llt:e several logical communication topologies; logic supporting the
plpelimng of instructions, both WithIn and among the IndiVidual PE's; a shared
partial Instruction decoder that reduces PInout and area, and a parallel self-lestlOg,
dynamically reconfIgurable, fault-tolerant RA...'A that SignIficantly IOcreases both Yield
.lnd rehablhty The deSIgn and operation of the ChIP are dIscussed, along With Its
speed. area. and power diSSipation charactenstIcs
I. Introduction
Decreasing Integrated CIrCUit deVice dimenSions, coupled With the introduction of
SImplIfied, highly structured VLSI deSign methods, have made feasible the
Implementatlon of a SIngle chip contalDlng a number of SImple processlDg elements
(PE's) Such ChIPS may be used to construct masslvely parallel machlDes contaInIng
very large numbers of PE's, and offerIng conslderable computatlOnal power In
certalD applicatlons. Several eXlstlDg machlDes are based on such an approach.
The leL DAP 1lJ. for exam pie, contalns 4K 1- bit processing elem~nts, whIle
Techn~cal Re90rt, De9artment of Computer Science, Columbia University, ~uguSt 1984
.,
Goodyear's ~vfPP [21 compnses 16K Single-bit PE's. Machines containing as many
as LOS to 106 processing elements [31. [41. [51 have recently been proposed for
Implementation uSing currently available MOS VLSI technology
In this paper, we desCrIbe the organlZatlOn a.nd operatlOn of an n~10S chip that
was deSigned at Columbia Unlverslty as the basl.s for a masSively parallel tree
structured machine called ~ON-VON 3. A simplified version of the ~ON-VON 3
ChiP, Incorporating most. but not all of the enhancements to be Incorporated 10 the
fInal deSign, h3S recently been fabrIcated 10 VLSI.
Among the novel features of the chip are:
1 .\n unusually high processor density Eight 8-blt PE's a.re Incorporated In the experImental version that has Just been fabrIcated, each of which embodies 64 bytes of RA .• \1. A. production version of the deSign should oifer the advantages of an "intelligent" RAM WIth only a modest Increase In are3. over the cost of an equivalent amount of ordinary static RAM .
. , LogiC support[ ng dynamic reconfig-uration of the machine to Implement several Inti>r-PE commUnICatlOn topologies. In particular, the deSign prOVIdes effiCient support for tree-structured, ltnear neighbor, and global communlC.lt:on multlple-m3.tch records
In addition, speCial features a.re prOVided for associative :-esolutlon, and for the sequential enumeration of matching
.3 Facliltles supporting both intra- and inter-PE pipelining of instruction execution SpeCIfically, the deSign allows nearly all instructions to be plpehned through successive levels of the tree, Yielding a.n execution 3peed :hat IS both reasonably fast In absolute terms and Independent of the ::Ize oi the tree The deSign also prOVides for the overlapped fetching and execution of InstructIons Within each PE
4 A. single, shared ·pre-decoder" on each chip that partially decodes lnstructlOns as they enter the chip ThiS allows InstructlOns to be encoded In a highly vertical manner for transmiSSion between chips and then translated to a more hOrIzontal InstructlOn format Within the chips. The number of pins required for off-chip communication IS thus minimized, whIle the area. cost of a substantial portion of the decoding logiC lS J.mortlzed over all eight PE's.
) A parallel sell-testing, dynamically reconflg;urable, fault-tolerant RAlvf, c3.pable surVIVIng both InItia.l fa.brIcatlon defects and subsequently
a.ppeanng faIlures ThIs facllity, whIch would not be practIcal m the absence of massively parallel testmg a.t "power-up" tIme, should greatly Increase both YIeld (because of the proportIon of SIlicon area expended on local RA.M) and rellabillty (beca.use of the abillty to recover from errors appeanng In the course of operatlOn)
In order to proVlde 3. motIvating context for these features, SectlOn II prOVides a
bnef overview of the organIzation of that portion of the :--rON-VON machme With
which we Will be concerned m this paper Section ill desCribes the structure of the
chip lnd the manner m which It IS connected to other chips of the same type,
whIle SectlOn f/ examines the organlzatlon of Its constituent PE's Section 'l
descnbes the operatIOn of the PPS chip, WIth particular emphasiS on functions
supported by novel aspects of the chip archItecture. In Section v1, we descnbe the
manner In which the chip was designed a.nd fabncated, a.nd offer a bnef diSCUSSIon
of Its speed, area, and power diSSipation charactenstlcs. The essentIal Import of our
mYestlgatlons IS bnefly summanzed In Section VII.
II. Organization or the NON-VON Primary Processing Subsystem
\Vhile the comple~~ :-;O~-VON a.rchltecture Includes several other modules, thiS
paper IS concern"?,J. only With that portion whIch IS known as the Primary
Processing Subsystem I PPS) Although phYSically structured :lS a. bmary tree, the
:'-iOL\-VON PPS ,:3.n be dynamlca.lly reconfigured to effiCiently support
commUnlCatlOn patterns charactenstlc of two other topologies, as we shall see In
SectIon V
The processing elements do not store and mdependently execute their own
programs, but ra.ther Simultaneously execute InstructlOns that are broadc3.St to them
by a smgle control processor (CP), which IS attached to the root of the PPS tree
DetaIls of the mechanisms for InstructlOn broadcast wIll be presented In SectlOn \-1
Whlle the complete NON-VON architecture supports a number of Independent and
asynchronous mstructlOn streams, :'-iON-VON 3 executes In a. strIctly synchronous.
single instruction stream, multiple data stream (SL\ID) [71 manner
4
III. Organization and Intereonnedion or the PPS Chips
The tree-structured NON-VON 3 PPS IS Implemented usmg a. single type of custom
n~10S chip, called the PPS ChiP, which contaInS eight PE's a.nd a single, shared
instructIon pre-decoder Our use of a sIngle chip type IS made possible by the
adoptlOn of a. tree-partltlOnIng scheme first suggested by Lelserson [8J ThIS
a.pproach embeds on ea.ch ChIP both a complete subtree (contalnrng 2c-l constltuent
PE's, for some c dependIng on deVice dimenslOns) and a smgle Internal tree node,
which we wlll call the solita.ry node. four nIne-bit busses (eight bits for data, and
one for control) enter the chip One, called the T connectIOn, leads to the root of
the chip's subtree, whIle the other three, called the F, L a.nd R connections, a.ttach
the Single solItary node to Its father, left child, and rIght chIld, respectively, wnhm
the tree
A. Simple recursive procedure allows the constructIOn of a complete binary tree of
arbitrary size USIng only chips of thIS type. ThiS construction IS Illustrated for the
case of two chips tn Flgure 1. Note that the resultmg CirCUIt consists of a. larger
complete binary s:jotree (In thiS case rooted by the solItary node of the chip on the
left Side of Figure 1) together With a. SIngle unconnected solitary node (the solitary
node of the chip on the nght) ThiS CIrCUit has the same four external connectIons
-- T. F, Land R -- as did a. SIngle chip ThiS procedure ma.y be repea.ted
recurSively to construct a. PPS comprising 2b_ 1 PE's, for arbitrarIly large b, a.nd
leaves only one solItary PE unused, Independent of the value of b.
Insert Figure 1 (Interconnection of Two tei.erlon Chip.) here .
. \n annotated microphotograph of the a.ctual PPS ChIP IS ;:;hown In Fl6ure 2. Since
all PE's have the same top-level floor plan (dlffenng only 10 certaIn lower-level
detaIls, as wlll be discussed shortly), the Internal PE structure has been shown only
ior the solitary PE.
------------------------------------------------------------------------------In.ert Figura 2 (Microphotograph of the MOl-VOl 3 PPS Chip) here.
------------------------------------------------------------------------------
The shared Instructlon pre-decoder translates the 8-blt vertical format used to
)
transmit lOstructlons between chips mto a. somewha.t more honzontal ll-blt format
which IS used within the chip. Because this function need be performed only once
per ChiP, and not once per PEr the technIque of pre-decoding results In a. reduction
In the amount of area required for local PU's without an Increase In P100Ut Pre
decodIng IS performed, however, only for the PE's of the on-chip subtree, lnd not
for tho;:;e of the solItary PE, since ~ON-VON's tree plpeliDlng scheme 10 general
causes a. dIfferent InstructIon to be executed at each level of the tree at any gIven
POInt In tIme. SInce the root of the subtree IS always at a different level than the
solitary PE, the InstructlOns to be executed by these two PE's at any given pOint
In time may differ, and must thus be separately decoded.
Instructions can not be fully decoded by the shared pre-decoder, SlOce dIfferent
control hnes are eXCited to execute a gIven InstructlOn depend on whether the PE IS
a leaf or an Internal node of the tree, and on whether It IS a left or nght chlld of
Its parent 8y'factonng out" part of the decodIng functlOn that does not depend
on the posltlOn of the PE wlthlO the tree, however, pre-decoding results In a
reductlOn of both ;Inout and area.
IV. Organization or the Processing Element
The )[0:"0'-VON PE Includes a 64 word X 9-blt random access memory' a. set of .. five 8-blt byte regi:;cers, called A8, 88, C8, 108, and I1L\..R; a set of fIve l-blt flag
registers. called Ai 81, Cl, 101, and EN1; an eight-bit arithmetic logical unit
I.-UL"I, a. programmable logic array (PL\) used for InstructlOn decoding and ot.her
('ontrol func:lOns, lnd two speCial combinational networks. called the I/O SWitch and
the RESOLVE CirCUit. A top-level block diagram of the PE IS presented In FIgure
J
In •• rt Figur. 3 (Block Di&gr~ of the Proc ••• ing Element) ~.r •.
With the exceptIon of e...L.\.R and E~l, a.ll of the byte a.nd fla.g regIsters may be
used as general-purpose regIsters. In lddltlon. however. all ten regIsters have ,
speCial functlOns (In some cases supported by speCial hardware), which wIll become
clear In the context of our deSCrIptIOn of the InstructlOn set. It Will sometimes be
6
convenient to refer to the pair of regIsters AS and Al as If they were one
compound register, to which we will a.pply the unsuffixed name A; the register pairs
B, C, and 10 are defined simIlarly.
ENl 15 distInguIshed as the enabJe flag. This flag IS used to activate and
deactlva.te Individual PE's Within the PPS. In gen~ral terms, only those PE's whose
enable flags .3.reasserted wIll respond ~o instructIOns broadcast by the CP If ENl
IS set to 0 In a particular PE, all InstructIons except one (the ENABLE instruction,
discussed below) wIll be Ignored. A number of trIcky Issues anse In consldenng the
behaVIOr of enabled and disabled PE's, particularly In the case of Inter-PE
communicatIOn operatIOns The precise semantics of commUnIcation between
enabled and disabled PE's wIll be deSCrIbed In Section V
Two Internal buses, called the A bus and the 10 bus, are used to transfer data
Within the PE Both are capable of transferrIng either one- or eight-bit data,
depending on the Instruction beIng executed. The PE's dual-bus organlzation IS
reqUired to support :--';ON-VON's Inter-PE commUnlcatlon instructIOns, which employ
one bus to send dJ.ta to a "neIghbOring" (In a sense that wIll soon be descnbed
more preCisely) PE. a.nd another to receive data concurrently from a different
neighbor The A. bus connects a.ll of the registers, the RA .. ~, the ALU, and the
I/O SWitch. The IO bus IS also connected to the ALV and 1/0 SWitch, but IS
connected to only 3. subset of the register pairs -- speCifically, A and 10 -- whIch
are dual-ported to allow transfers across both buses
The ljO SWitch IS a. matnx of pass :ranslstors that routes data between the two
internal bU5es lnd the three Inter-PE comm unicatIOn buses (parent, left chIld. and
rIght chIld) In the course of executIng Inter-PE commUnlcatlon Instructions
Depending on the particular Instruction, these SWitches may be conhgured In such a.
way as to support one of several different logical commUnlCatIOn topologies, as
dIscussed In SectIOn V The RESOLv"E; CirCUit IS another com blnatonal network that
IS used to realize a particular Instruction that selects a Single PE out of a set of
"marked" PE's, whIch In practIce have typicalir been Identified through a. process
of asSOCiative matchIng The semantics of the RESOLVE instructIOn Will also be
descnbed In Section V
-I
Finally, the local RAj,\f compnses a 64 X 9-blt arra.y of 6-translstor sta.tIC RA..\f
cells, together with control and decoding circuitry that allows access to one 8- or 1-
bit location per InstructlOn cycle. The Instruction set supports up to '256 8-blt and
256 I-bit RAM locations. A production version of the chip would In fact Include a.
'256-blt RA .. \f, In order to minimiZe expenmenta.l fabncatlon costs by reduc!ng die
Size, however. only 64 9-blt 10catlOns, Implemented as two arrays of 32 loca.tlons
each, were Included In the current prototype chip.
In a production version contamlng 256 RA.M locations, our measurements (presented
In SectlOn \f1) Indicate that Just over half of the dIe area would be used for RA ... \f
cells CSlng a slm pIe POisson model to estimate the inCidence of defects, It wa;:,
determined that the IncorporatlOn of fault-tolerant circuitry WithIn the local RA .. \1 of
each PE would very slgmficlntly Increase the Yield of working PPS chips.
\10reover the massively parallel structure of the NON-VON PPS offered a unIque
opportunity to recover not only from fa.brlcatlon defects, but also from fallures due
to such phenomenJ. :lS metal mlgratlOn, which frequently occur In the course of
circuit opentlon
Our approach IS Gl.:ed on the Simultaneous testing and dynamiC reconfIgllratlOn of
the local RA .. \1's')illl PE's each time the machine IS "powered up" or a memory
error IS detected In addltlOn to the advantages of dynamiC fallure recovery, thIS
technique IS 1m pi;;-mented usmg a standard nMOS process, obViating the need for
laser customllatlOn, t.he incorporatIon of fUSible hnks, or other specla.l proceSSing of
,he ~ort used to manufacture most commerCially-available iault-tolerant RA_\fs
. ..\5 In conv~ntlonal fault-tolerant RAM's, each of the two J2-word RA_\1 arrays :s
prOVided '.vlth 3. 'spare') column, which may substituted for lny stogIe column
'shlch IS found to be flulty A maXlmU:Il of 16 defectIve RA .. \1 columns may thus
be tolerated WIthin each PPS chip (provldmg, of course, that the faults are
distrIbuted perfectly across the RA .. \1 a.rrays -- an unhkely event to practice). If
a.ny of the RA .. \1 arrays embedded In a given chip conta.IOS more tha.n one faulty
column. however, the chip can not be reconfigured for correct opentlon, and must
be discarded. In contrast With conventional RAM's, the column to be replaced IS
5-2lected :lot by phYSically reconftgunng the hardwa.re, but by settlOg an partIcular
8
flag accordlOg to the results of a parallel test program that concurrently tests a.
common 8-blt 10catlOn lQ all PE's. (No fault-tolera.nce IS prOVIded for the i-bIt
RAM array.) This process IS repeated 64 tlmes to test all locatIons.
It IS worth notmg that the practIcality of the dynamIC R.A..\1 configuratlon algorIthm
IS cntlc.llly dependent on the extenSIve, fine-gralQed parallelism offered by the
~ON-\'Ol'i machme. and would not be practIcal LQ a sequentIal system embodYIng a
comparable amount of RA.\1. By wa.y of IllustratIon, we consider the case of a
production ~O~-VON machme contalnlQg a mllhon PE's, each embodYlOg 256 bytes
of local RA .. \1. We assume that 100 NON-VON lDstructions are reqUlred to test
each byte of RA .. \1 (mcluding tests of Its functlOnality whlle different values are
stored In ne!ghbonng cells) At '250 nanoseconds per InstructIon, approxImately 6 4
mIlllseconds would be reqUIred to test the entIre machme In parallel, Independent of
the num ber of PE'.:; By way of comparIson, a sequentlal machine haVing an
Instruction cycle tIme of '~50 nanoseconds would reqUlre nearly two hours to
perform the eqUlvllent tests on the sa.me a.mount of RAM.
V. Operation or the PPS Chip
The operatIOn of : he \'O~-VON .3 PPS ChIP IS best understood by revIewIng the
PE InstructlOn set The semantICS of each InstructlOn a.re descnbed below, along
wIth the set of p~::nlssable operands, where a.pproprIate
INSTRUCTION
WOVS <byte reg 1> <byte reg 2>
<byte rag> = {AS. 8S. C8. MAR. lOS}
waV1 <bit reg 1> <bit rig 2>
<bit reg> = {Al, Bl. Cl. EXl. I01}
SElWlTICS
<byte reg 2> <- <byte reg 1>
<bit reg 2> <- <bit reg 1>
The \10V8 a.nd \fOVl :nstructlons transfer a.re used to transfer data. between bit
and byte regIsters Within the PE
READRAM8
IRlTElWIB
<byte reg)
READ lWI 1
IRITERAJU
<bit reg>
!HCREKEHT
<byte rig>
<byte reg)
= {A8. 88. C8 or I08}
<bit reg>
<bit reg>
= {Al. 81. Cl or IOl}
(byt. reg> (- RAU (IWl... • RAMS (KAI) <- (byt. reg>
(bit rig> (- RAM1 (IKAR)
RAM1 (KAI) (- (bit rig>
¥AI (- ¥AI + 1
9
The RE.-\DRAM and 'NRITERA..\1 instructions a.re used to transfer data between a
register a.nd the RA ... \i locatIOn whose a.ddress IS stored In the "lOcrementlOg memory
address register". L\t-lli. The INCRE}V1ENT instructIOn a.dds one to the address
stored In the I?-.1..-lli (The IMAR ma.y also be "a.uto-lOcremented" by executIng a
3trlng ioadlng or stnng matching Inst:-uctIOn, both of which wIll be discussed later
In thiS sectIOn)
ADD <byte reg> ca (- «byt. reg> + A8 + Cl); Cl (- carry
SUB <byte feg> C8 (- (A8 - <byte reg> - Cl); Cl <- borrow
COKPARE <byte feg> it <byte reg> = A8 then A1 <- 1 d .. Ai (- 0;
it (byte reg> ) A8 thin 81 <- 1 .11. 81 (- 0
<byte reg> = {B8. I08. MAR. or RAM}
The ADD, SL13 and CO~fP.-lliE InstructlOns may be used to perform arithmetic
a.nd :om parlSon operations on two ~blt operands The ca.rry bit must be deared
before these Instructions are Inltla.ted. The results of a COt-.fPARE are stored In
the Al and Bl flags
Rons RaTtS
Rotate 8 right 1 bit
Rotate B le~t 1 bit
The 88 a.nd 81 registers contalO logiC enablIng them to function together as a 9-blt
cIrcular shIft regIster Specifically, ROTRB shifts all but the low-order bit of 88
Into the next lowest bit positIOn within B8, the low-order bit of B8 IS moved mto
10
81, and the value prevlOusly stored to 81 IS moved IOto the high-order bit of 88.
ROTLB performs an a.nalogous left circular shift. The 8 regISter can thus be used
to transfer information between the 8- a.nd I-bit data paths.
LOGICAL! <operation>
LOGICALl <operation>
C8 (- (A8 <op~rat1on) B8)
Cl (- (A1 <operation> Bl)
where <operatIOn> IS a. four-bit code speclfylOg one of the sixteen pOSSible boolean
functions of two varIables. LOGICAL8 applies the speCified operation 10 a bitwIse
fashion to all eIght bits of Its operands. Special cases of the LOGICALl lOstructlon
Include SET, CLEAR, ~"'EGATE, AND, OR, XOR, EQU, NAND, NOR and NOP.
LOGICALl may be used to comblOe the results of a COMPARE InstructIon to test
a.ll SIX pOSSible anthmetlc relational ?redIca.tes (EQ, NE, CT, L T, GE and LE) on
two 8-blt operands
SENDe <?E> <byte reg>
SEND 1 <PE> <bit reg>
RECVe <?E> <byte reg>
RECVl <?E> <bit reg>
<byt. reg> = {A8. 98, ca. WAR}
<bit reg> = {Al. 81. Cl. Ell}
IOe «PE»
rOl «P!»
<byte reg>
<bit reg>
<?E> = {LC. RC. LN. RH} tor SEND instruction. {LC. RC. LN, RH. PR} tor RECV in.truction.
(- <byte reg>
<- <bit reg>
<- I08 «PE»
<- IOi «PE»
The SE0U 3.nd RECV instructions a.re used to transfer data. In parallei not only
between PE's that are phYSically adjacent Within the PPS, but a.lso between two
PE's that are adjacent In a particular lInear sequence defined by an Inorder
traversal [91 of the nodes of the PPS tree In both cases, data IS transferred
between some register In the PE In which the Instruction IS executed a.nd the 10
register Within some neighbOrIng PE. It IS always pOSSible to RECV data from a
PE, regardless of whether It IS enabled. but a.n a.ttempt to SE~ data to a.
disabled PE will not result In a. transfer of data..
In the case of tree commUnlCatlOn, the neighborIng PE IS chosen to be either the
ieft child (LC), rIght chIld (RC), or parent (PR) (PR IS disallowed 303 the operand
11
of a SEND instruction, however, since the semantICS of tlus operatlOn would be tll
defined 10 the case where both chIldren simulta.neously attempted to transfer
Information to their common parent.) In the case of linear neighbor
communicatlOn, the neIghborIng PE IS chosen to be eIther the left linear neIghbor
(L~) or nght linear neIghbor (RN), whIch respectively deSignate the PE's occunng
ImmedIately before and ImmedIately aiter (in Lnorder traversal order) the PE
executIng the InstructlOn.
BROADCAST8 <byte reg> <byte> <byte reg) <- <byte>
BROADCAST! <bit reg) <bit> <bit reg) <- <bit)
REPORT8 <byte reg> logical reg. GC8 (in CP) <- <byte reg>
REPORT! <bit reg> logical reg. G<:1 (in CP) <- <bit reg>
<byte reg> = {A8, SS, ca. 1IAI, IOS}
<bit reg> = {Ai. S1. C1. Ell, IOt}
The BROADCAST InstructIons are used to transfer a slOgle data. value from the
control processor lnto a. specIfied destlOatlOn regIster Within all ena.bled PE's
SImultaneously The REPORT instructions, on the other hand, are defined only
when exactly one PE IS enabled, and result 10 the transfer of data from the
specIfied regIster WIthin that PE to a partIcular ((logIcal register" wlthm the control
processor, whIch IS called GG.
RESOLVE Ai <- 0 in all PE'. except 'fir.t ' PE .here Ai = 1;
it no PE hal Ai = 1 thIn logical register Rl (in CP) <- 0 ell' Rl <- 1
In many 0ION-VON programs, a subset of the PE's, often IdentIfied by asSOCIatIve
matchIng, are "marked" by stonng a 1 In one of the flag regIsters or one-bit RAM
locatlOns. The RESOLve InstructlOn can be used to IdentIfy exa.ctly one of the
members of a. set of PE's that have preVIously been marked USing the Al flag.
Alter executlOn of a. RESOL YE, the Al regIster IS reset to zero In all PE's except
:he one that occurs fIrst tn I~order traversal order The RESOL \'E InstructlOn :s
frequently used In conJunctlOn With REPORT to read data Into the control
processor from each one of a set of PE's In turn.
ENABLE Ell <- 1 in all PE'., including tho.e pr.Tiou.ly di.abl.d
12
ENABLE IS the only instruction that IS executed by all PE's, whether or not they
are already enabled It IS used to set all of the the ENl registers to 1. thus
:: awakentng" all PE's In the PPS after some subset ha.ve been dIsa.bled.
STRIHGBROAOCAST <length> <.tring>
STRIHGREPORT <length>
STRIHGCOMPARE <length> <.triug>
Th. I •• antic. ot th •• e three operation. ar. d,.crib.d b.low
The three strIng operatlOns use the autcrlncrement capabIlIty of the MAR to
perform highly effiCIent 10adlOg, unloading, a.nd matchmg operatlOns on succeSSIve
10catlOns In R.~\{ The approprIate opcode IS broadcast to the PPS only once per
stnng. Simultaneously, the control processor raises a speCial control Wire, called
ISl'""NC, whIch causes all enabled PE's to enter a speCIal state 10 which they are
prepared to accept a strIng of data '::lytes without intervenIng opcodes, Incrementing
their respective ~L\"R's after each byte WhIle the last byte IS being broa.dcast, the
control processor lowers IS'{NC to cause all PE's to eXit thiS speCIal state.
The STRI0iGBRO.WC.\ST instructIon transfers a. common strIng from the control
processor Into the local RA.\1's of all enabled PE's, startIng at the location speCified
by their respective ~L\.R's (In the case of applIcations InvolvlDg records haVing
van able-length fields, these startIng a.ddresses wIll not In general be the same In all
PE's) STRINGREPORT functlOns In a. similar manner, but IS used to transfer a
5tnng Into the control processor from a Single enabled PE STRINGCO~fPARE
compares a stnng broadcast by the control processor In parallel against those stored
In a. II enabled PE's If a data byte IS broadcast that falls to match the
corresponding byte stored Within some PE, the ENl flag IS set to 0 10 that PE,
disablIng It for the remainder of the matching process. At the end of the
STRL.'lCCOMPARE lDstruction, only those PE's contammg a. matchIng stnng lre
left -=nabled
\fost of the InstructlOns In the NON-VON instruction set are expected to requIre
approxImately 250 nanoseconds for :xecutlon, Independent of the SIze of the PPS
1.3
tree ThIs rate IS a.ttaIned by plpelimng the broadcast of all but a few of the
InstructIOns, latching each instruction at every level within the tree. Data broadcast
to all PE's from the control processor is also pipelined down the tree. The linear
neighbor communlcatlOn InstructIOns, however, along with RESOL'iE lnd REPORT,
make global use of the PPS, and are not plpelined. The pipeline must be
"flushed" before these instructIOns a.re executed,- requmng a longer tnstructlOn
penod. The exact length of thIS penod Will depend loganthmlcally on the n urn ber
of PE's embodied tn the particular NON-VON configuratlOn at hand. In addItIon
to ~ON-VON's level-by-Ievel tree plpelimng scheme, the deSign provIdes for the
overlapped fetching and executIon of instructlOns WIthin each PE.
Vl. Design and Fabrication
The datapath In ea.ch PE IS Identical, and was laId out manually USIng a graphical
editing system to achieve the highest pOSSible packing density. Four dlfferent
vanants of the control path were necessary, however, to accomodate the somewhat
different behavlOr of leaves and Internal nodes, and, of left and nght chIldren. In
order to produce :~'?se vanants WIthout IntrodUCIng human error, and lQ order to
shorten the development cycle to allow extenSIve expenmentatlOn WIth alternatIve
Instruction sets, 1 ~emi-automatlc layout program called PLATO [10] was developed.
PLATO accepts ~ lnput a. set of tnstructlOn opcodes, together WIth assOCIated
control InformatlOn, and produces as ~utput a functionally correct, hIghly a.rea.
effiCient set of PLA's. Among the novel aspects of the program are the use of a
c~annel routing algonthm to generate a Weinberger Array layout for the PL-'\. and
the generatlOn, from a. SIngle tnput descnptlOn, of dIfferent PL-'\. va.nants
correspondIng to processing elements serVIng different functlOns wlthm the machIne.
The ChIP was Implemented USIng 3 micron nMOS deSign rules, and was fabrIcated
through the MOSIS "sdicon brokerage" system, whIch IS operated by the
InformatIOn Sciences Institute under contract wlth the Defense Advanced Research
Projects Agency The ChIP contaIns a.bout 65,000 tranSIstors, a.nd has a. functlOnal
pmout of)3 The die measures 8.1 X i 4 lThIIl, of which a.pproxlmately 19% IS
occupIed by R..Al\r. A. productlOn VerSlOQ of the ChIP containIng 256 bytes of loca.l
RA ... \1 per PE would thus requlre only about 2.1 tImes as much area as would be
14
occupied by an eqUlvalent amount of ordinary sta.tlC RA\{ The chip diSSipates
a.pproxlmately 34 watts of power, requmng the use of finned packa.ges and forced
air coolIng.
VII. Conclusion
The :'-lON-VON J PPS chip achieves a. very high processor denSity while stlll
a.tta.lnIng high executIOn speeds on those operatIOns that a.ppear to be most
Important In masSively parallel SUvID tree algonthms. A productIOn version of the
chip would form the basiS for a.n "Intelligent memory" costing only about tWIce as
much as an eqUivalent amount of ordinary sta.tlC RAM, but incorporating an elght
bit processing .::lement along With .::very 256 bits of storage. A number of novel
a.rchltectural features are Incorporated In the deSign, many of which should prove
appltcable to other masSively parallel machines
Acknowledgement
:\hny members of the Columbia's NON-VON Project contrIbuted to the work
descnbed In t::IS paper The authors would like to gIve partIcular
acknowledgement, however, to the major contrIbutIOns of John K. La.I, Bnan
:\ia.thles, a.nd Robert :\10nta.y, who ha.d pnnclpal responSibIlity for the NON-VON 3
PPS chip layout. and for the development of some of the software tools used lQ Its
generatIOn
ThiS research was supported In pa.rt by the Defense Advanced Research Projects
A.gency under contract N'00039-80-G-0132, a.nd In part by the New York Sta.te
Center for Advanced Technology lQ Computers a.nd InformatIOn Systems at
Columbia University
References
[II P :\1. Flanders, D J Hunt, S F Reddaway, and D Parkinson, '(EffiCient hlgh
speed computing With the Dlstnbuted Array Processor", In High Speed Computer
a.nd Algorithm Organiza.tion. )lew York: AcademiC, 19ii, pp 113 .. 128.
["2\ J L. Potter,-'1mage processing on the ~asslvely Parallel Processor", Computer,
vol. 16, no. 1, Jan. 1983
15
[31 D. E. Shaw, "The NON-VON Supercomputer", TechnIcal Report, Columbia
Computer SCience Department, August, 1982.
[41 W 0 Hillis, "The Connection ~achlDe", TechnIcal ~emo, AItlilclal Intelhgence
Laboratory, \1a.ssachusetts Institute of Technology, September, 1981.
[)l T Kondo, T ~akashlma, M. Aoki a.nd T Sudo, "An LSI adaptive array
processor" IEEE Journal of Solid-State Circuits, vol. SC-18, no. 2, pp 147-156,
Aug 1983
[61 D E. Shaw,:SL\ID a.nd .\;fS1\ID vanants of the ~ON-VON supercomputer",
Proc COMPCON '84, San FranCISco, Ca.lIfornla, Feb. 1984.
[i] \1. Flynn, 'Some Computer Organizations a.nd their Effectiveness), In IEEE
Transactions on Computers, vol. C-21, pp. 948-960, September, 1972.
[81 C E Lelser30n. ...vea-Efficient \11..51 Computation, Ph.D TheSIS, Dept. of
Com puler SCience C arnegle-Mellon UniverSity, October 1981.
[91 0 E Knuth. The . ..vt of Computer Programming, vol. 1: Fundamental
Algorithms. AddI50:'.- Wesley, 1969
[101 T \1 Sabety B \1athles and D E. Shaw, "The Seml-Automatlc GeneratlOn of
Processing Eleme~t Control Paths for Highly Parallel \fachmes", Proc 21st
AC\1/IEEE DeSign Automation Conference, .-\.lbuquerque, ~ew \fexlco, June, 1984.
L. " T
1---------' ,--_ ... _--
' _______ •• 1 1 ______ -_.'
Figure 1. [n~rccnne<:tlon 0{ Two Lel5erson Chips
o E Shw 1I1d T ~ s.b.ty
PLEASE :-;OTE
Our mIcrophotograph of the elght-PE chip was not received In time to be
3ub mltted along wIth the review copy of thIs paper In order to allow the paper to
be reVIewed Without delay I we have substituted a microphotograph of the preVIOUS ,
fou r-PE V~rSlon of the chip fo r now , and Will replace It With the elght-PE
microphotograph , which should arrIve any day , tI the paper IS accepted for
publication
We have attached an overl ay chat should be printed In
micro photograph Itself to Identify different portions of the chip
hot jOet been superImposed It should not be difficult
l m~ allCl 'dove n. ' disce rn the st ructure of the chip u' ing thiS separate overlay
whIte over the
Although the two
fo r tbe referees to
Incovenle nce resultmg fro m the Incomplete state of thiS fIgure We apologIze for any
. DEShaw and T ~{ SabeLY
LE FT CHILD PE
ROOT PE
SOLITARY PE . PLA
("0 ," TROL Lf. .... "'E 3LrrERS .. DATA PATH ...
I"-LOCAL R.;\.~
-~'::; .;:::: "'5 ~ ~ :;:;~ ~
L'-:STRL'cnON R!1::ZSTER
RESOL\ i;: CIRCl:li
I/O SWITCH
RIGHT CHILD PE
f Igure 2. .\flcrophotolr:lph of the .'10N-VON J PPS Clup (Overlay )
?= ~ ., ..., -:..,;(.; .... "'''I - -
~ ,- A. """,."" • a:i - ..... • • ~;;:~ I I
I I
~~ I J
, , 1 fOJ.L4;\S all
-....,. !
SOl
-~< -,..-J= < ~
5~
~ I ,""
I
----~~'----------l;iYWI I-----l
\
i~~ 10 .....J