1lJ. elem~nts, - Columbia University

Abstract

An Elght-Proeesaor ChIp

ror a M&88iveJy ParaHeJ Maehlne

David Elliot Shaw Theodore M. Sabety

Department of Computer SCience Columbia Universlty

New York, NY 10027 212-280-8100

CXS-13J-34

This paper descrIbes a y"LSI chip that serves as the basis for a masslvely parallel

tree machine called :'-lON-VON 3. The chlp, whlch is lmplemented 10 3-micron

n~fOS technology :ontalOS eight ~blt processing elements (PE's), each embodYIng

64 bytes of static RA...\L Slgmficant features of the design Include: an unusually

high processor density, a novel I/O sWitch that allows the machine to dynamically

reconilgure to r-=llt:e several logical communication topologies; logic supporting the

plpelimng of instructions, both WithIn and among the IndiVidual PE's; a shared

partial Instruction decoder that reduces PInout and area, and a parallel self-lestlOg,

dynamically reconfIgurable, fault-tolerant RA...'A that SignIficantly IOcreases both Yield

.lnd rehablhty The deSIgn and operation of the ChIP are dIscussed, along With Its

speed. area. and power diSSipation charactenstIcs

I. Introduction

Decreasing Integrated CIrCUit deVice dimenSions, coupled With the introduction of

SImplIfied, highly structured VLSI deSign methods, have made feasible the

Implementatlon of a SIngle chip contalDlng a number of SImple processlDg elements

(PE's) Such ChIPS may be used to construct masslvely parallel machlDes contaInIng

very large numbers of PE's, and offerIng conslderable computatlOnal power In

certalD applicatlons. Several eXlstlDg machlDes are based on such an approach.

The leL DAP 1lJ. for exam pie, contalns 4K 1- bit processing elem~nts, whIle

Techn~cal Re90rt, De9artment of Computer Science, Columbia University, ~uguSt 1984

.,

Goodyear's ~vfPP [21 compnses 16K Single-bit PE's. Machines containing as many

as LOS to 106 processing elements [31. [41. [51 have recently been proposed for

Implementation uSing currently available MOS VLSI technology

In this paper, we desCrIbe the organlZatlOn a.nd operatlOn of an n~10S chip that

was deSigned at Columbia Unlverslty as the basl.s for a masSively parallel tree

structured machine called ~ON-VON 3. A simplified version of the ~ON-VON 3

ChiP, Incorporating most. but not all of the enhancements to be Incorporated 10 the

fInal deSign, h3S recently been fabrIcated 10 VLSI.

Among the novel features of the chip are:

1 .\n unusually high processor density Eight 8-blt PE's a.re Incorporated In the experImental version that has Just been fabrIcated, each of which embodies 64 bytes of RA .• \1. A. production version of the deSign should oifer the advantages of an "intelligent" RAM WIth only a modest Increase In are3. over the cost of an equivalent amount of ordinary static RAM .

. , LogiC support[ ng dynamic reconfig-uration of the machine to Implement several Inti>r-PE commUnICatlOn topologies. In particular, the deSign prOVIdes effiCient support for tree-structured, ltnear neighbor, and global communlC.lt:on multlple-m3.tch records

In addition, speCial features a.re prOVided for associative :-esolutlon, and for the sequential enumeration of matching

.3 Facliltles supporting both intra- and inter-PE pipelining of instruction execution SpeCIfically, the deSign allows nearly all instructions to be plpehned through successive levels of the tree, Yielding a.n execution 3peed :hat IS both reasonably fast In absolute terms and Independent of the ::Ize oi the tree The deSign also prOVides for the overlapped fetching and execution of InstructIons Within each PE

4 A. single, shared ·pre-decoder" on each chip that partially decodes lnstructlOns as they enter the chip ThiS allows InstructlOns to be encoded In a highly vertical manner for transmiSSion between chips and then translated to a more hOrIzontal InstructlOn format Within the chips. The number of pins required for off-chip communication IS thus minimized, whIle the area. cost of a substantial portion of the decoding logiC lS J.mortlzed over all eight PE's.

) A parallel sell-testing, dynamically reconflg;urable, fault-tolerant RAlvf, c3.pable surVIVIng both InItia.l fa.brIcatlon defects and subsequently

a.ppeanng faIlures ThIs facllity, whIch would not be practIcal m the absence of massively parallel testmg a.t "power-up" tIme, should greatly Increase both YIeld (because of the proportIon of SIlicon area expended on local RA.M) and rellabillty (beca.use of the abillty to recover from errors appeanng In the course of operatlOn)

In order to proVlde 3. motIvating context for these features, SectlOn II prOVides a

bnef overview of the organIzation of that portion of the :--rON-VON machme With

which we Will be concerned m this paper Section ill desCribes the structure of the

chip lnd the manner m which It IS connected to other chips of the same type,

whIle SectlOn f/ examines the organlzatlon of Its constituent PE's Section 'l

descnbes the operatIOn of the PPS chip, WIth particular emphasiS on functions

supported by novel aspects of the chip archItecture. In Section v1, we descnbe the

manner In which the chip was designed a.nd fabncated, a.nd offer a bnef diSCUSSIon

of Its speed, area, and power diSSipation charactenstlcs. The essentIal Import of our

mYestlgatlons IS bnefly summanzed In Section VII.

II. Organization or the NON-VON Primary Processing Subsystem

\Vhile the comple~~ :-;O~-VON a.rchltecture Includes several other modules, thiS

paper IS concern"?,J. only With that portion whIch IS known as the Primary

Processing Subsystem I PPS) Although phYSically structured :lS a. bmary tree, the

:'-iOL\-VON PPS ,:3.n be dynamlca.lly reconfigured to effiCiently support

commUnlCatlOn patterns charactenstlc of two other topologies, as we shall see In

SectIon V

The processing elements do not store and mdependently execute their own

programs, but ra.ther Simultaneously execute InstructlOns that are broadc3.St to them

by a smgle control processor (CP), which IS attached to the root of the PPS tree

DetaIls of the mechanisms for InstructlOn broadcast wIll be presented In SectlOn \-1

Whlle the complete NON-VON architecture supports a number of Independent and

asynchronous mstructlOn streams, :'-iON-VON 3 executes In a. strIctly synchronous.

single instruction stream, multiple data stream (SL\ID) [71 manner

4

III. Organization and Intereonnedion or the PPS Chips

The tree-structured NON-VON 3 PPS IS Implemented usmg a. single type of custom

n~10S chip, called the PPS ChiP, which contaInS eight PE's a.nd a single, shared

instructIon pre-decoder Our use of a sIngle chip type IS made possible by the

adoptlOn of a. tree-partltlOnIng scheme first suggested by Lelserson [8J ThIS

a.pproach embeds on ea.ch ChIP both a complete subtree (contalnrng 2c-l constltuent

PE's, for some c dependIng on deVice dimenslOns) and a smgle Internal tree node,

which we wlll call the solita.ry node. four nIne-bit busses (eight bits for data, and

one for control) enter the chip One, called the T connectIOn, leads to the root of

the chip's subtree, whIle the other three, called the F, L a.nd R connections, a.ttach

the Single solItary node to Its father, left child, and rIght chIld, respectively, wnhm

the tree

A. Simple recursive procedure allows the constructIOn of a complete binary tree of

arbitrary size USIng only chips of thIS type. ThiS construction IS Illustrated for the

case of two chips tn Flgure 1. Note that the resultmg CirCUIt consists of a. larger

complete binary s:jotree (In thiS case rooted by the solItary node of the chip on the

left Side of Figure 1) together With a. SIngle unconnected solitary node (the solitary

node of the chip on the nght) ThiS CIrCUit has the same four external connectIons

-- T. F, Land R -- as did a. SIngle chip ThiS procedure ma.y be repea.ted

recurSively to construct a. PPS comprising 2b_ 1 PE's, for arbitrarIly large b, a.nd

leaves only one solItary PE unused, Independent of the value of b.

Insert Figure 1 (Interconnection of Two tei.erlon Chip.) here .

. \n annotated microphotograph of the a.ctual PPS ChIP IS ;:;hown In Fl6ure 2. Since

all PE's have the same top-level floor plan (dlffenng only 10 certaIn lower-level

detaIls, as wlll be discussed shortly), the Internal PE structure has been shown only

ior the solitary PE.

------------------------------------------------------------------------------In.ert Figura 2 (Microphotograph of the MOl-VOl 3 PPS Chip) here.

------------------------------------------------------------------------------

The shared Instructlon pre-decoder translates the 8-blt vertical format used to

)

transmit lOstructlons between chips mto a. somewha.t more honzontal ll-blt format

which IS used within the chip. Because this function need be performed only once

per ChiP, and not once per PEr the technIque of pre-decoding results In a. reduction

In the amount of area required for local PU's without an Increase In P100Ut Pre

decodIng IS performed, however, only for the PE's of the on-chip subtree, lnd not

for tho;:;e of the solItary PE, since ~ON-VON's tree plpeliDlng scheme 10 general

causes a. dIfferent InstructIon to be executed at each level of the tree at any gIven

POInt In tIme. SInce the root of the subtree IS always at a different level than the

solitary PE, the InstructlOns to be executed by these two PE's at any given pOint

In time may differ, and must thus be separately decoded.

Instructions can not be fully decoded by the shared pre-decoder, SlOce dIfferent

control hnes are eXCited to execute a gIven InstructlOn depend on whether the PE IS

a leaf or an Internal node of the tree, and on whether It IS a left or nght chlld of

Its parent 8y'factonng out" part of the decodIng functlOn that does not depend

on the posltlOn of the PE wlthlO the tree, however, pre-decoding results In a

reductlOn of both ;Inout and area.

IV. Organization or the Processing Element

The )[0:"0'-VON PE Includes a 64 word X 9-blt random access memory' a. set of .. five 8-blt byte regi:;cers, called A8, 88, C8, 108, and I1L\..R; a set of fIve l-blt flag

registers. called Ai 81, Cl, 101, and EN1; an eight-bit arithmetic logical unit

I.-UL"I, a. programmable logic array (PL\) used for InstructlOn decoding and ot.her

('ontrol func:lOns, lnd two speCial combinational networks. called the I/O SWitch and

the RESOLVE CirCUit. A top-level block diagram of the PE IS presented In FIgure

J

In •• rt Figur. 3 (Block Di&gr~ of the Proc ••• ing Element) ~.r •.

With the exceptIon of e...L.\.R and E~l, a.ll of the byte a.nd fla.g regIsters may be

used as general-purpose regIsters. In lddltlon. however. all ten regIsters have ,

speCial functlOns (In some cases supported by speCial hardware), which wIll become

clear In the context of our deSCrIptIOn of the InstructlOn set. It Will sometimes be

6

convenient to refer to the pair of regIsters AS and Al as If they were one

compound register, to which we will a.pply the unsuffixed name A; the register pairs

B, C, and 10 are defined simIlarly.

ENl 15 distInguIshed as the enabJe flag. This flag IS used to activate and

deactlva.te Individual PE's Within the PPS. In gen~ral terms, only those PE's whose

enable flags .3.reasserted wIll respond ~o instructIOns broadcast by the CP If ENl

IS set to 0 In a particular PE, all InstructIons except one (the ENABLE instruction,

discussed below) wIll be Ignored. A number of trIcky Issues anse In consldenng the

behaVIOr of enabled and disabled PE's, particularly In the case of Inter-PE

communicatIOn operatIOns The precise semantics of commUnIcation between

enabled and disabled PE's wIll be deSCrIbed In Section V

Two Internal buses, called the A bus and the 10 bus, are used to transfer data

Within the PE Both are capable of transferrIng either one- or eight-bit data,

depending on the Instruction beIng executed. The PE's dual-bus organlzation IS

reqUired to support :--';ON-VON's Inter-PE commUnlcatlon instructIOns, which employ

one bus to send dJ.ta to a "neIghbOring" (In a sense that wIll soon be descnbed

more preCisely) PE. a.nd another to receive data concurrently from a different

neighbor The A. bus connects a.ll of the registers, the RA .. ~, the ALU, and the

I/O SWitch. The IO bus IS also connected to the ALV and 1/0 SWitch, but IS

connected to only 3. subset of the register pairs -- speCifically, A and 10 -- whIch

are dual-ported to allow transfers across both buses

The ljO SWitch IS a. matnx of pass :ranslstors that routes data between the two

internal bU5es lnd the three Inter-PE comm unicatIOn buses (parent, left chIld. and

rIght chIld) In the course of executIng Inter-PE commUnlcatlon Instructions

Depending on the particular Instruction, these SWitches may be conhgured In such a.

way as to support one of several different logical commUnlCatIOn topologies, as

dIscussed In SectIOn V The RESOLv"E; CirCUit IS another com blnatonal network that

IS used to realize a particular Instruction that selects a Single PE out of a set of

"marked" PE's, whIch In practIce have typicalir been Identified through a. process

of asSOCiative matchIng The semantics of the RESOLVE instructIOn Will also be

descnbed In Section V

-I

Finally, the local RAj,\f compnses a 64 X 9-blt arra.y of 6-translstor sta.tIC RA..\f

cells, together with control and decoding circuitry that allows access to one 8- or 1-

bit location per InstructlOn cycle. The Instruction set supports up to '256 8-blt and

256 I-bit RAM locations. A production version of the chip would In fact Include a.

'256-blt RA .. \f, In order to minimiZe expenmenta.l fabncatlon costs by reduc!ng die

Size, however. only 64 9-blt 10catlOns, Implemented as two arrays of 32 loca.tlons

each, were Included In the current prototype chip.

In a production version contamlng 256 RA.M locations, our measurements (presented

In SectlOn \f1) Indicate that Just over half of the dIe area would be used for RA ... \f

cells CSlng a slm pIe POisson model to estimate the inCidence of defects, It wa;:,

determined that the IncorporatlOn of fault-tolerant circuitry WithIn the local RA .. \1 of

each PE would very slgmficlntly Increase the Yield of working PPS chips.

\10reover the massively parallel structure of the NON-VON PPS offered a unIque

opportunity to recover not only from fa.brlcatlon defects, but also from fallures due

to such phenomenJ. :lS metal mlgratlOn, which frequently occur In the course of

circuit opentlon

Our approach IS Gl.:ed on the Simultaneous testing and dynamiC reconfIgllratlOn of

the local RA .. \1's')illl PE's each time the machine IS "powered up" or a memory

error IS detected In addltlOn to the advantages of dynamiC fallure recovery, thIS

technique IS 1m pi;;-mented usmg a standard nMOS process, obViating the need for

laser customllatlOn, t.he incorporatIon of fUSible hnks, or other specla.l proceSSing of

,he ~ort used to manufacture most commerCially-available iault-tolerant RA_\fs

. ..\5 In conv~ntlonal fault-tolerant RAM's, each of the two J2-word RA_\1 arrays :s

prOVided '.vlth 3. 'spare') column, which may substituted for lny stogIe column

'shlch IS found to be flulty A maXlmU:Il of 16 defectIve RA .. \1 columns may thus

be tolerated WIthin each PPS chip (provldmg, of course, that the faults are

distrIbuted perfectly across the RA .. \1 a.rrays -- an unhkely event to practice). If

a.ny of the RA .. \1 arrays embedded In a given chip conta.IOS more tha.n one faulty

column. however, the chip can not be reconfigured for correct opentlon, and must

be discarded. In contrast With conventional RAM's, the column to be replaced IS

5-2lected :lot by phYSically reconftgunng the hardwa.re, but by settlOg an partIcular

8

flag accordlOg to the results of a parallel test program that concurrently tests a.

common 8-blt 10catlOn lQ all PE's. (No fault-tolera.nce IS prOVIded for the i-bIt

RAM array.) This process IS repeated 64 tlmes to test all locatIons.

It IS worth notmg that the practIcality of the dynamIC R.A..\1 configuratlon algorIthm

IS cntlc.llly dependent on the extenSIve, fine-gralQed parallelism offered by the

~ON-\'Ol'i machme. and would not be practIcal LQ a sequentIal system embodYIng a

comparable amount of RA.\1. By wa.y of IllustratIon, we consider the case of a

production ~O~-VON machme contalnlQg a mllhon PE's, each embodYlOg 256 bytes

of local RA .. \1. We assume that 100 NON-VON lDstructions are reqUlred to test

each byte of RA .. \1 (mcluding tests of Its functlOnality whlle different values are

stored In ne!ghbonng cells) At '250 nanoseconds per InstructIon, approxImately 6 4

mIlllseconds would be reqUIred to test the entIre machme In parallel, Independent of

the num ber of PE'.:; By way of comparIson, a sequentlal machine haVing an

Instruction cycle tIme of '~50 nanoseconds would reqUlre nearly two hours to

perform the eqUlvllent tests on the sa.me a.mount of RAM.

V. Operation or the PPS Chip

The operatIOn of : he \'O~-VON .3 PPS ChIP IS best understood by revIewIng the

PE InstructlOn set The semantICS of each InstructlOn a.re descnbed below, along

wIth the set of p~::nlssable operands, where a.pproprIate

INSTRUCTION

WOVS <byte reg 1> <byte reg 2>

<byte rag> = {AS. 8S. C8. MAR. lOS}

waV1 <bit reg 1> <bit rig 2>

<bit reg> = {Al, Bl. Cl. EXl. I01}

SElWlTICS

<byte reg 2> <- <byte reg 1>

<bit reg 2> <- <bit reg 1>

The \10V8 a.nd \fOVl :nstructlons transfer a.re used to transfer data. between bit

and byte regIsters Within the PE

READRAM8

IRlTElWIB

<byte reg)

READ lWI 1

IRITERAJU

<bit reg>

!HCREKEHT

<byte rig>

<byte reg)

= {A8. 88. C8 or I08}

<bit reg>

<bit reg>

= {Al. 81. Cl or IOl}

(byt. reg> (- RAU (IWl... • RAMS (KAI) <- (byt. reg>

(bit rig> (- RAM1 (IKAR)

RAM1 (KAI) (- (bit rig>

¥AI (- ¥AI + 1

9

The RE.-\DRAM and 'NRITERA..\1 instructions a.re used to transfer data between a

register a.nd the RA ... \i locatIOn whose a.ddress IS stored In the "lOcrementlOg memory

address register". L\t-lli. The INCRE}V1ENT instructIOn a.dds one to the address

stored In the I?-.1..-lli (The IMAR ma.y also be "a.uto-lOcremented" by executIng a

3trlng ioadlng or stnng matching Inst:-uctIOn, both of which wIll be discussed later

In thiS sectIOn)

ADD <byte reg> ca (- «byt. reg> + A8 + Cl); Cl (- carry

SUB <byte feg> C8 (- (A8 - <byte reg> - Cl); Cl <- borrow

COKPARE <byte feg> it <byte reg> = A8 then A1 <- 1 d .. Ai (- 0;

it (byte reg> ) A8 thin 81 <- 1 .11. 81 (- 0

<byte reg> = {B8. I08. MAR. or RAM}

The ADD, SL13 and CO~fP.-lliE InstructlOns may be used to perform arithmetic

a.nd :om parlSon operations on two ~blt operands The ca.rry bit must be deared

before these Instructions are Inltla.ted. The results of a COt-.fPARE are stored In

the Al and Bl flags

Rons RaTtS

Rotate 8 right 1 bit

Rotate B le~t 1 bit

The 88 a.nd 81 registers contalO logiC enablIng them to function together as a 9-blt

cIrcular shIft regIster Specifically, ROTRB shifts all but the low-order bit of 88

Into the next lowest bit positIOn within B8, the low-order bit of B8 IS moved mto

10

81, and the value prevlOusly stored to 81 IS moved IOto the high-order bit of 88.

ROTLB performs an a.nalogous left circular shift. The 8 regISter can thus be used

to transfer information between the 8- a.nd I-bit data paths.

LOGICAL! <operation>

LOGICALl <operation>

C8 (- (A8 <op~rat1on) B8)

Cl (- (A1 <operation> Bl)

where <operatIOn> IS a. four-bit code speclfylOg one of the sixteen pOSSible boolean

functions of two varIables. LOGICAL8 applies the speCified operation 10 a bitwIse

fashion to all eIght bits of Its operands. Special cases of the LOGICALl lOstructlon

Include SET, CLEAR, ~"'EGATE, AND, OR, XOR, EQU, NAND, NOR and NOP.

LOGICALl may be used to comblOe the results of a COMPARE InstructIon to test

a.ll SIX pOSSible anthmetlc relational ?redIca.tes (EQ, NE, CT, L T, GE and LE) on

two 8-blt operands

SENDe <?E> <byte reg>

SEND 1 <PE> <bit reg>

RECVe <?E> <byte reg>

RECVl <?E> <bit reg>

<byt. reg> = {A8. 98, ca. WAR}

<bit reg> = {Al. 81. Cl. Ell}

IOe «PE»

rOl «P!»

<byte reg>

<bit reg>

<?E> = {LC. RC. LN. RH} tor SEND instruction. {LC. RC. LN, RH. PR} tor RECV in.truction.

(- <byte reg>

<- <bit reg>

<- I08 «PE»

<- IOi «PE»

The SE0U 3.nd RECV instructions a.re used to transfer data. In parallei not only

between PE's that are phYSically adjacent Within the PPS, but a.lso between two

PE's that are adjacent In a particular lInear sequence defined by an Inorder

traversal [91 of the nodes of the PPS tree In both cases, data IS transferred

between some register In the PE In which the Instruction IS executed a.nd the 10

register Within some neighbOrIng PE. It IS always pOSSible to RECV data from a

PE, regardless of whether It IS enabled. but a.n a.ttempt to SE~ data to a.

disabled PE will not result In a. transfer of data..

In the case of tree commUnlCatlOn, the neighborIng PE IS chosen to be either the

ieft child (LC), rIght chIld (RC), or parent (PR) (PR IS disallowed 303 the operand

11

of a SEND instruction, however, since the semantICS of tlus operatlOn would be tll

defined 10 the case where both chIldren simulta.neously attempted to transfer

Information to their common parent.) In the case of linear neighbor

communicatlOn, the neIghborIng PE IS chosen to be eIther the left linear neIghbor

(L~) or nght linear neIghbor (RN), whIch respectively deSignate the PE's occunng

ImmedIately before and ImmedIately aiter (in Lnorder traversal order) the PE

executIng the InstructlOn.

BROADCAST8 <byte reg> <byte> <byte reg) <- <byte>

BROADCAST! <bit reg) <bit> <bit reg) <- <bit)

REPORT8 <byte reg> logical reg. GC8 (in CP) <- <byte reg>

REPORT! <bit reg> logical reg. G<:1 (in CP) <- <bit reg>

<byte reg> = {A8, SS, ca. 1IAI, IOS}

<bit reg> = {Ai. S1. C1. Ell, IOt}

The BROADCAST InstructIons are used to transfer a slOgle data. value from the

control processor lnto a. specIfied destlOatlOn regIster Within all ena.bled PE's

SImultaneously The REPORT instructions, on the other hand, are defined only

when exactly one PE IS enabled, and result 10 the transfer of data from the

specIfied regIster WIthin that PE to a partIcular ((logIcal register" wlthm the control

processor, whIch IS called GG.

RESOLVE Ai <- 0 in all PE'. except 'fir.t ' PE .here Ai = 1;

it no PE hal Ai = 1 thIn logical register Rl (in CP) <- 0 ell' Rl <- 1

In many 0ION-VON programs, a subset of the PE's, often IdentIfied by asSOCIatIve

matchIng, are "marked" by stonng a 1 In one of the flag regIsters or one-bit RAM

locatlOns. The RESOLve InstructlOn can be used to IdentIfy exa.ctly one of the

members of a. set of PE's that have preVIously been marked USing the Al flag.

Alter executlOn of a. RESOL YE, the Al regIster IS reset to zero In all PE's except

:he one that occurs fIrst tn I~order traversal order The RESOL \'E InstructlOn :s

frequently used In conJunctlOn With REPORT to read data Into the control

processor from each one of a set of PE's In turn.

ENABLE Ell <- 1 in all PE'., including tho.e pr.Tiou.ly di.abl.d

12

ENABLE IS the only instruction that IS executed by all PE's, whether or not they

are already enabled It IS used to set all of the the ENl registers to 1. thus

:: awakentng" all PE's In the PPS after some subset ha.ve been dIsa.bled.

STRIHGBROAOCAST <length> <.tring>

STRIHGREPORT <length>

STRIHGCOMPARE <length> <.triug>

Th. I •• antic. ot th •• e three operation. ar. d,.crib.d b.low

The three strIng operatlOns use the autcrlncrement capabIlIty of the MAR to

perform highly effiCIent 10adlOg, unloading, a.nd matchmg operatlOns on succeSSIve

10catlOns In R.~\{ The approprIate opcode IS broadcast to the PPS only once per

stnng. Simultaneously, the control processor raises a speCial control Wire, called

ISl'""NC, whIch causes all enabled PE's to enter a speCIal state 10 which they are

prepared to accept a strIng of data '::lytes without intervenIng opcodes, Incrementing

their respective ~L\"R's after each byte WhIle the last byte IS being broa.dcast, the

control processor lowers IS'{NC to cause all PE's to eXit thiS speCIal state.

The STRI0iGBRO.WC.\ST instructIon transfers a. common strIng from the control

processor Into the local RA.\1's of all enabled PE's, startIng at the location speCified

by their respective ~L\.R's (In the case of applIcations InvolvlDg records haVing

van able-length fields, these startIng a.ddresses wIll not In general be the same In all

PE's) STRINGREPORT functlOns In a. similar manner, but IS used to transfer a

5tnng Into the control processor from a Single enabled PE STRINGCO~fPARE

compares a stnng broadcast by the control processor In parallel against those stored

In a. II enabled PE's If a data byte IS broadcast that falls to match the

corresponding byte stored Within some PE, the ENl flag IS set to 0 10 that PE,

disablIng It for the remainder of the matching process. At the end of the

STRL.'lCCOMPARE lDstruction, only those PE's contammg a. matchIng stnng lre

left -=nabled

\fost of the InstructlOns In the NON-VON instruction set are expected to requIre

approxImately 250 nanoseconds for :xecutlon, Independent of the SIze of the PPS

1.3

tree ThIs rate IS a.ttaIned by plpelimng the broadcast of all but a few of the

InstructIOns, latching each instruction at every level within the tree. Data broadcast

to all PE's from the control processor is also pipelined down the tree. The linear

neighbor communlcatlOn InstructIOns, however, along with RESOL'iE lnd REPORT,

make global use of the PPS, and are not plpelined. The pipeline must be

"flushed" before these instructIOns a.re executed,- requmng a longer tnstructlOn

penod. The exact length of thIS penod Will depend loganthmlcally on the n urn ber

of PE's embodied tn the particular NON-VON configuratlOn at hand. In addItIon

to ~ON-VON's level-by-Ievel tree plpelimng scheme, the deSign provIdes for the

overlapped fetching and executIon of instructlOns WIthin each PE.

Vl. Design and Fabrication

The datapath In ea.ch PE IS Identical, and was laId out manually USIng a graphical

editing system to achieve the highest pOSSible packing density. Four dlfferent

vanants of the control path were necessary, however, to accomodate the somewhat

different behavlOr of leaves and Internal nodes, and, of left and nght chIldren. In

order to produce :~'?se vanants WIthout IntrodUCIng human error, and lQ order to

shorten the development cycle to allow extenSIve expenmentatlOn WIth alternatIve

Instruction sets, 1 ~emi-automatlc layout program called PLATO [10] was developed.

PLATO accepts ~ lnput a. set of tnstructlOn opcodes, together WIth assOCIated

control InformatlOn, and produces as ~utput a functionally correct, hIghly a.rea.

effiCient set of PLA's. Among the novel aspects of the program are the use of a

c~annel routing algonthm to generate a Weinberger Array layout for the PL-'\. and

the generatlOn, from a. SIngle tnput descnptlOn, of dIfferent PL-'\. va.nants

correspondIng to processing elements serVIng different functlOns wlthm the machIne.

The ChIP was Implemented USIng 3 micron nMOS deSign rules, and was fabrIcated

through the MOSIS "sdicon brokerage" system, whIch IS operated by the

InformatIOn Sciences Institute under contract wlth the Defense Advanced Research

Projects Agency The ChIP contaIns a.bout 65,000 tranSIstors, a.nd has a. functlOnal

pmout of)3 The die measures 8.1 X i 4 lThIIl, of which a.pproxlmately 19% IS

occupIed by R..Al\r. A. productlOn VerSlOQ of the ChIP containIng 256 bytes of loca.l

RA ... \1 per PE would thus requlre only about 2.1 tImes as much area as would be

14

occupied by an eqUlvalent amount of ordinary sta.tlC RA\{ The chip diSSipates

a.pproxlmately 34 watts of power, requmng the use of finned packa.ges and forced

air coolIng.

VII. Conclusion

The :'-lON-VON J PPS chip achieves a. very high processor denSity while stlll

a.tta.lnIng high executIOn speeds on those operatIOns that a.ppear to be most

Important In masSively parallel SUvID tree algonthms. A productIOn version of the

chip would form the basiS for a.n "Intelligent memory" costing only about tWIce as

much as an eqUivalent amount of ordinary sta.tlC RAM, but incorporating an elght

bit processing .::lement along With .::very 256 bits of storage. A number of novel

a.rchltectural features are Incorporated In the deSign, many of which should prove

appltcable to other masSively parallel machines

Acknowledgement

:\hny members of the Columbia's NON-VON Project contrIbuted to the work

descnbed In t::IS paper The authors would like to gIve partIcular

acknowledgement, however, to the major contrIbutIOns of John K. La.I, Bnan

:\ia.thles, a.nd Robert :\10nta.y, who ha.d pnnclpal responSibIlity for the NON-VON 3

PPS chip layout. and for the development of some of the software tools used lQ Its

generatIOn

ThiS research was supported In pa.rt by the Defense Advanced Research Projects

A.gency under contract N'00039-80-G-0132, a.nd In part by the New York Sta.te

Center for Advanced Technology lQ Computers a.nd InformatIOn Systems at

Columbia University

References

[II P :\1. Flanders, D J Hunt, S F Reddaway, and D Parkinson, '(EffiCient hlgh

speed computing With the Dlstnbuted Array Processor", In High Speed Computer

a.nd Algorithm Organiza.tion. )lew York: AcademiC, 19ii, pp 113 .. 128.

["2\ J L. Potter,-'1mage processing on the ~asslvely Parallel Processor", Computer,

vol. 16, no. 1, Jan. 1983

15

[31 D. E. Shaw, "The NON-VON Supercomputer", TechnIcal Report, Columbia

Computer SCience Department, August, 1982.

[41 W 0 Hillis, "The Connection ~achlDe", TechnIcal ~emo, AItlilclal Intelhgence

Laboratory, \1a.ssachusetts Institute of Technology, September, 1981.

[)l T Kondo, T ~akashlma, M. Aoki a.nd T Sudo, "An LSI adaptive array

processor" IEEE Journal of Solid-State Circuits, vol. SC-18, no. 2, pp 147-156,

Aug 1983

[61 D E. Shaw,:SL\ID a.nd .\;fS1\ID vanants of the ~ON-VON supercomputer",

Proc COMPCON '84, San FranCISco, Ca.lIfornla, Feb. 1984.

[i] \1. Flynn, 'Some Computer Organizations a.nd their Effectiveness), In IEEE

Transactions on Computers, vol. C-21, pp. 948-960, September, 1972.

[81 C E Lelser30n. ...vea-Efficient \11..51 Computation, Ph.D TheSIS, Dept. of

Com puler SCience C arnegle-Mellon UniverSity, October 1981.

[91 0 E Knuth. The . ..vt of Computer Programming, vol. 1: Fundamental

Algorithms. AddI50:'.- Wesley, 1969

[101 T \1 Sabety B \1athles and D E. Shaw, "The Seml-Automatlc GeneratlOn of

Processing Eleme~t Control Paths for Highly Parallel \fachmes", Proc 21st

AC\1/IEEE DeSign Automation Conference, .-\.lbuquerque, ~ew \fexlco, June, 1984.

L. " T

1---------' ,--_ ... _--

' _______ •• 1 1 ______ -_.'

Figure 1. [n~rccnne<:tlon 0{ Two Lel5erson Chips

o E Shw 1I1d T ~ s.b.ty

PLEASE :-;OTE

Our mIcrophotograph of the elght-PE chip was not received In time to be

3ub mltted along wIth the review copy of thIs paper In order to allow the paper to

be reVIewed Without delay I we have substituted a microphotograph of the preVIOUS ,

fou r-PE V~rSlon of the chip fo r now , and Will replace It With the elght-PE

microphotograph , which should arrIve any day , tI the paper IS accepted for

publication

We have attached an overl ay chat should be printed In

micro photograph Itself to Identify different portions of the chip

hot jOet been superImposed It should not be difficult

l m~ allCl 'dove n. ' disce rn the st ructure of the chip u' ing thiS separate overlay

whIte over the

Although the two

fo r tbe referees to

Incovenle nce resultmg fro m the Incomplete state of thiS fIgure We apologIze for any

. DEShaw and T ~{ SabeLY

LE FT CHILD PE

ROOT PE

SOLITARY PE . PLA

("0 ," TROL Lf. .... "'E 3LrrERS .. DATA PATH ...

I"-LOCAL R.;\.~

-~'::; .;:::: "'5 ~ ~ :;:;~ ~

L'-:STRL'cnON R!1::ZSTER

RESOL\ i;: CIRCl:li

I/O SWITCH

RIGHT CHILD PE

f Igure 2. .\flcrophotolr:lph of the .'10N-VON J PPS Clup (Overlay )

?= ~ ., ..., -:..,;(.; .... "'''I - -

~ ,- A. """,."" • a:i - ..... • • ~;;:~ I I

I I

~~ I J

, , 1 fOJ.L4;\S all

-....,. !

SOl

-~< -,..-J= < ~

5~

~ I ,""

I

----~~'----------l;iYWI I-----l

\

i~~ 10 .....J

Date post:	11-Apr-2022
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

1lJ. elem~nts, - Columbia University

Documents