[IEEE 2008 International Conference on Application-Specific Systems, Architectures and Processors...

RECONNECT: A NoC for polymorphic ASICs using a Low Overhead SingleCycle Router

Nimmy Joseph, Ramesh Reddy C, Keshavan Varadarajan, Mythri Alle, Alexander Fell, S K NandyCAD Lab, SERC, Indian Institute of Science, Bangalore

{jnimmy, crreddy, keshavan, mythri, alefel, nandy}@cadl.iisc.ernet.in

Ranjani NarayanMorphing Machines, Bangalore, India

[email protected]

Abstract

A polymorphic ASIC is a runtime reconfigurable hard-ware substrate comprising compute and communication el-ements. It is a ”future proof” custom hardware solution formultiple applications and their derivatives in a domain. In-teroperability between application derivatives at runtime isachieved through hardware reconfiguration. In this paperwe present the design of a single cycle Network on Chip(NoC) router that is responsible for effecting runtime re-configuration of the hardware substrate. The router designis optimized to avoid FIFO buffers at the input port andloop back at output crossbar. It provides virtual channelsto emulate a non-blocking network and supports a simpleX-Y relative addressing scheme to limit the control over-head to 9 bits per packet. The 8×8 honeycomb NoC (RE-CONNECT) implemented in 130nm UMC CMOS standardcell library operates at 500MHz and has a bisection band-width of 28.5GBps. The network is characterized for ran-dom, self-similar and application specific traffic patternsthat model the execution of multimedia and DSP kernelswith varying network loads and virtual channels. Our im-plementation with 4 virtual channels has an average net-work latency of 24 clock cycles and throughput of 62.5%of the network capacity for random traffic. For applicationspecific traffic the latency is 6 clock cycles and throughputis 87% of the network capacity.

1 Introduction

Emerging embedded applications in domains such asmultimedia, networking, security, bioinformatics, etc. areimplemented on ASICs and bound to that specific hardwareplatform. A frequent exchange or later upgrade of theseapplications and their derivatives call for flexible hardware

solutions. While ASICs offer significant power and perfor-mance advantage over other programmable solutions, theNRE costs of ASICs can only be amortized by high pro-duction rates. A flexible and reprogrammable ASIC wouldbe ”future proof” and cost effective since a single ASICcan now support multiple applications and their derivatives.We refer to such a flexible ASIC as a polymorphic ASIC.In FPGAs alteration of applications at runtime cannot beachieved easily because of the latency caused by each con-figuration reload whenever there is an application switch.On the other hand a polymorphic ASIC (an abstract modelis shown in Figure 1) is a programmable multiprocessor ona chip, where both the data path and the control path are ”re-defined” during execution to meet the performance require-ments of applications. Polymorphic ASIC is a hardwaresubstrate comprising a regular array of tiles. A tile consistsof a Compute Element (CE) and a network router. Thesetiles are connected through a high performance Network onChip (NoC) which can be reconfigured at runtime.

Met

adat

a O

rche

stra

tion

Supp

ort L

ogic

......

...

...

...

ComputeElement

ComputeElement

...

Programmable Interconnect

ExecutionFabric

Figure 1. Abstract model of polymorphic ASIC

Given an application specification in a high level lan-guage, it is fragmented into substructures called HyperOps[7]. A HyperOp is a sub graph of the application data flowgraph. At runtime the CEs are programmed according to the

1-4244-1898-5/08/$20.00 ©2008 IEEE 251

operations represented by the nodes of the sub graph. Sinceseveral CEs of the polymorphic ASIC are aggregated at run-time to map the HyperOp and its operations, it is equivalentto a customized hardware (ASIC) for the HyperOp.

The polymorphic ASIC can execute several scheduledHyperOps simultaneously, if sufficient CEs are available.The Metadata Orchestration Logic is responsible for select-ing HyperOps that are ready to execute, and for loadingthem onto a set of CEs. The set of CEs which are assignedto a HyperOp follow static dataflow semantics, while theMetadata Orchestration Logic supports dynamic dataflowsemantics among multiple HyperOps. Therefore a polymor-phic ASIC is akin to a dynamic dataflow machine.

The data paths of an ASIC are predefined and static,whereas the paths of a polymorphic ASIC are realizedthrough the NoC giving the flexibility to change with ev-ery newly loaded HyperOp which is bought at an increasedlatency. Since the edges between operations of a HyperOprepresent communication between CEs, the physical map-ping of HyperOp onto the CEs in the execution fabric mustmeet specific communication criteria such as minimum dis-tances. This will ensure a low hop count and lower prob-ability for congestion for each packet routed through theNoC.

Rest of the paper is organized as follows. Section 2 de-scribes the NoC features and design considerations with re-spect to topology, interconnects, router and its flow controlmechanism. Simulation and synthesis results are presentedin Section 3. Section 4 concludes the paper.

T T T T T T T T

T T T T T T T T

T T T T T T T T

T T T T T T T T

T T T T T T T T

T T T T T T T T

T T T T T T T T

T T T T T T T T

HyperOp Launcher

Express L

anes

57

Tile = CE + Router

From HyperOp Store

(57 * 2)114

48 * 22 = 1056(Payload=48 bits)

399(57 * 7;

57 = Header+48)

399

Figure 2. Proposed Network on Chip

2 RECONNECT: The NoC

The NoC is a regular interconnection of tiles. Each tile isan aggregation of elementary hardware resources compris-ing compute and communication elements (routers). NoCefficiency depends on the topology and the router perfor-mance while the design trade offs are silicon area, powerand delay. In this paper we explain the proposed NoC ar-chitecture and the router for communication among tiles.Over-design of the router is avoided since the polymorphicASICs are designed for specific application domains withknown bounds on latency and throughput.

To exploit the instruction level parallelism exhibited bythe HyperOps, multiple operations are performed by differ-ent CEs simultaneously. Depending on the instruction levelparallelism, a HyperOp is split into several sub-sets calledp-HyperOps [7], where each p-HyperOp is a set of instruc-tions assigned to a CE. This allows parallel execution ofmultiple instructions. This necessitates parallel transfer ofinstructions to the polymorphic ASIC.

Our implementation consists of 64 tiles arranged in 8×8regular structure connected through a honeycomb topologyas shown in Figure 2. The CE of a tile interfaces to therouter through a Network Interface (NI) unit. The routerin a tile has 4 ports of which one is connected to the NI ofthe tile, the remaining three are connected to the adjacentrouters (except the routers at the periphery). Routers at theperiphery of the honeycomb have at least one port free - weterm these ports as launching ports. There are 22 launchingports in a 8×8 honeycomb topology through which instruc-tions can be loaded parallelly to the CEs. The HyperOps arestored in HyperOp store which is an on chip memory. TheHyperOp launcher acts as an interface between the launch-ing ports and the HyperOp store. It has 22 NI units for par-allel transfer of 22 packets. Topology, NoC interconnectsand router are explained in the following sections.

2.1 Topology and Interconnects

Honeycomb topology is chosen as interconnection net-work on the fabric since it has a lower degree per node thana 2-D Mesh [9]. This reduces the complexity and area ofthe network router. A detailed comparison of the honey-comb and mesh topologies is provided in [10].

The interconnects are divided into two logical sets. Thefirst set of interconnections is called Express Lanes (Figure2). These facilitates parallel instruction transfer from theHyperOp launcher to boundary tiles and reduce the load onthe internal network. If the destination is not the boundarytile, the instructions are routed from the boundary tiles tothe destination CEs through the internal network. The sec-ond set of wires connects the tiles, which are used for interCE data transfer in addition to the instruction transfer to the

252

internal CEs from boundary tiles. Thus RECONNECT issuitable for polymorphic ASIC as compared to some of theother NoC architectures proposed [5, 6].

For an instruction width of 57 bits (Figure 3), a totalof 1254 (57 x 22) wires are required for routing 22 paral-lel Express Lanes. Using metal 6 wires of pitch 480 nm,for 65nm CMOS technology node total width for ExpressLanes is 0.6019mm [2]. This is permissible for a 64 tile(8×8) fabric where each tile size is 3mm2 resulting in a diesize of 196mm2(14mm×14mm). Long wires used with re-peaters can support up to 2 GHz for 14mm×14mm fabricin 65nm CMOS technology node and more than 2 GHz for130nm CMOS technology node as global wire delay doesnot scale [1]. Therefore routing long wires between Hy-perOp store and launching ports is not the freqency limitingfactor for NoCs operating below 2GHz.

2.2 Router

In a typical ASIC, the functionality of CEs and routersis integrated into a monolithic hardware element. In a poly-morphic ASIC, the functionality of CEs and routers is de-coupled to facilitate reconfiguration. At gigascale levels ofintegration, an on-chip crossbar interconnect is not a viablesolution. Since the router is just a data forwarder and doesnot contribute to logic, the router overhead should be as lowas possible (within the limits of meeting the traffic require-ments of a given application).

The different components of the router are shown in Fig-ure 4. It has four input ports and four output ports. Ofthese, three are connected to the adjacent routers, fourth isconnected to the Network Interface (NI) unit of the CE. TheNI is connected to the east port if it is not connected to theadjacent router; otherwise the NI is connected to the westport. The router consists of input buffers, input and outputarbiters, one to four Mux and Demux units.

Two levels of arbitration are used in the router. First levelis the virtual channel arbitration at each input port whichselects one of the virtual channels. If there are ’K’ virtualchannels at the input port we need a K:1 arbiter. The secondlevel arbitration is for output port (Output arbiter). Since therouter has four input ports and loop back is not provided, a3:1 arbiter is used at the output reducing the router area.

Flow Control Mechanism: Network throughput canbe increased by dividing the buffer storage associated witheach network channel into several virtual channels [3]. Eachphysical channel is associated with several small queues,called virtual channels rather than a single long queue. Thevirtual channels associated with one physical channel areallocated independently but compete with each other forphysical bandwidth. We use virtual channel flow controlmechanism at each input port to increase throughput and toprevent head of line blocking.

X Relative

Address

Y Relative

Address

New Data

Indicator

9510 56

Payload(48 bits)

Figure 3. Packet format

The packet size (Figure 3) is taken as flit1 size. Net-work bandwidth between two adjacent routers is equal tothe packet size. Splitting the packet into multiple flits willmake the network inefficient in our case as the CE will beidle till the whole packet is received. Further it requires ad-ditional logical circuitry for splitting the packets into flitsand their reassembly at the destination.

Each input port has a port status out bit (Figure 4) in-dicating whether one of the virtual channels is free. Thesynchronization between two adjacent routers is based onthis bit. In the router, the received packets are stored in oneof the free virtual channels in a round robin fashion. Routeroutputs packet only if one of virtual channels of the adja-cent router is free. Otherwise, the packet is blocked at theinput of the current router. Thus packets are not dropped inthe network. The non availability of the virtual channels ofadjacent router increases the average latency of the overallnetwork. Once a packet is sent to an adjacent router, thecorresponding virtual channel status of the current router isupdated by the Port Update Logic to indicate that it is free toreceive another packet. A toggle in the MSB (Bit 0) of thereceived packet indicates new packet. Packet is stored onlyif the MSB is different from that of the previous packet.

Routing Algorithm: The routing algorithm is describedwith respect to Figure 4. Packets are routed along the short-est path to the destination. Honeycomb topology has hor-izontal links on every alternate node. Therefore the rout-ing algorithm prioritizes horizontal links over vertical ones(Figure 2). The output request signal (output req from vc)generated from the packets in the virtual channels indi-cates the direction in which the packet has to travel. TheK:1 virtual channel arbiter at each input port selects oneof the virtual channels for which the requested adjacentrouter input port is free and sends the corresponding request(req from east etc.) to the Request Decoder, which gener-ates decoded req for Output Arbiter.

At each router, the output port to which the packet is tobe sent is determined based on the relative address. At thesource a packet is formed by concatenating the X, Y relativeaddresses of destination and payload. 4-bit 2’s complementarithmetic is used for address updating in the router. If apacket has to travel in the X direction when horizontal linkis not available and Y relative address is zero, then it takessouth (north) direction and Y relative address is modifiedaccordingly so that after traveling in X direction the packet

1A flit is the smallest flow control unit of the network

253

east_out[56:0]

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Data

Control Signal

VC_sel[1:0]

VC_sel[1:0]

VC_sel[1:0]

VC_sel[1:0]

decoded_req

mux_sel[1:0]

mux_sel[1:0]

mux_sel[1:0]

Address Updated Packet

VC Arbiter

Address Update

Logic

VC Arbiter

VC Arbiter

VC Arbiter

req_from_north[1:0]

req_from_south[1:0]

req_from_west[1:0]

req_from_east[1:0]

VC1

VC2

VC2

VC3

VC1

VC1

VC2

VC3

VC4

VC1

VC2

VC3

VC4

VC3

VC4

VC4

port

_sta

tus_

out_

nort

h

north_in[56:0]

port

_sta

tus_

out_

sout

h

south_in[56:0]

port

_sta

tus_

out_

wes

t

east_in[56:0]

west_in[56:0]

CLK

Port Update L

ogicPort U

pdate Logic

Port Update L

ogic

demux_sel[1:0]

demux_sel[1:0]

VC_status[3:0]

demux_sel[1:0]

VC_status[3:0]

demux_sel[1:0]

R1 R2 R3 R4P

R1 R2 R3 R4P

R1 R2P R3 R4

R1 R2P R3 R4

Req D

ecoderR

eq Decoder

Req D

ecoderR

eq Decoder

Output A

rbiterO

utput Arbiter

Output A

rbiterO

utput Arbiter

P = port_status_in[3:0] R1, R2, R3, R4 = output_ req_from_VCport

_sta

tus_

out_

east Port U

pdate Logic

VC_status[3:0]

VC_status[3:0]

north_out[56:0]

mux_sel[1:0]

south_out[56:0]

west_out[56:0]

��

Figure 4. Data and control path scheme of the router with four virtual channels(VC)

moves in north (south) direction. The relative address up-dating logic for four directions is described below.

North : Y relative address new = Y relative address + 1South : Y relative address new = Y relative address − 1West : X relative address new = X relative address + 1East : X relative address new = X relative address − 1

The relative address is updated before a packet is routedto an adjacent router depending on the chosen direction.The Output Arbiter arbitrates among the outputs from thefour virtual channel arbiters at level 1. The arbitration atthe virtual channels of the input port and output port is doneusing a Round Robin algorithm. To indicate the new datato the next router, the New Data Indicator bit (MSB of thepacket) is toggled at the output port. When the packet isreceived, the output port direction is calculated in the sameclock cycle including the arbitration and the routing. Thusif all the four adjacent routers are free, four packets can berouted to the adjacent routers in a single clock cycle. Whenthe destination is reached, the packet is sent to the corre-sponding CE in the next cycle.

Router Crossbar: Crossbar is at the second level arbi-tration (Output Arbiter) of the router which can route four

packets to four different ports in single cycle. The routercrossbar design is optimized by avoiding the loop backwhich is prevented by the routing algorithm; i.e. an inputreceived on a port is not transmitted back on the same port,resulting in reduced area of the crossbar.

3 Performance Analysis

We evaluated the proposed NoC for different traffic pat-terns that are representative of real applications. The NoC isconstructed using RTL code in Verilog HDL and simulatedusing Mentor Graphics Modelsim by driving it through atest-bench written in Verilog HDL. Synopsys Design Com-piler is used for synthesis of the NoC.

The NoC is simulated using four types of traffic, viz.random, self-similar and two application specific traffic pat-terns that model the execution of multimedia and DSP ker-nels on polymorphic ASIC. In random traffic, each tile gen-erates packets with random destinations. Packets are in-jected into the network from all ports every clock cyclewhenever the ports are available.

Self-similar traffic has been observed in the bursty trafficbetween on-chip modules in typical MPEG-2 video appli-cations [4]. It has been shown that modeling of self-similartraffic can be obtained by aggregating a large number of

254

ON-OFF message sources. The length of time each mes-sage spends in either the ON or the OFF state should be se-lected according to a distribution which exhibits long-rangedependence. The Pareto distribution (F (x) = 1−x−α, withl < α < 2) has been found to fit well to this kind of traffic. A

packet train remains in the ON state for tON = (1− r)−1

αON

and in the OFF state for tOFF = (1 − r)−1

αOF F where r isa random number uniformly distributed between 0 and 1,αON = 1.9, and αOFF = 1.25 [8].

Two application specific traffic patterns refer to the traf-fic generated from the traffic trace given by a topology (hon-eycomb) aware simulator of polymorphic ASIC for whichthis NoC is targeted for. We simulated this to mimic therealistic scenario of the polymorphic ASIC in which theHyperOp Launcher, realized in Verilog HDL, transfers 22packets simultaneously onto the fabric from the nearestboundary port. Intermediate results generated by the CEsare transferred over the network at specified intervals as de-manded by the application. Communication due to trafficgenerated by multimedia kernels spans 4×4 tessellations onthe execution fabric; four such tessellations execute in par-allel occupying the entire fabric. Communication due totraffic generated by DSP kernels include multiple 2×2 and4×4 tessellations on the execution fabric.

3.1 Throughput and Latency

We compare the throughput and latency for various traf-fic patterns on RECONNECT. Figure 5 shows the variationof throughput with the number of virtual channels for dif-ferent types of traffic. Throughput is the maximum trafficaccepted by the network and it relates to the peak data ratessustainable by the system. The accepted traffic depends onthe rate at which data is injected into the network. Ide-ally accepted traffic should increase linearly with injectionload. However, due to the limitation of routing resources,accepted traffic will saturate at a certain injection load.

Throughput expressed as a fraction of network capacity,is measured by applying a saturation source to each port thatinjects a new packet into the network whenever the port isfree, which results in insertion of more packets when routercontains more virtual channels. As it can be observed inFigure 5, it saturates when the number of virtual channelsexceeds 4. Higher throughput is achieved for applicationspecific traffic as compared to random or self-similar trafficpatterns, due to higher near neighborhood communicationacross tiles. The variation of throughput with virtual chan-nels shows a similar trend for all four types of traffic.

Figure 6 shows the variation of latency with the numberof virtual channels. Network latency is measured by find-ing the average difference between arrival time at the desti-nation and launching time at the source, including the timespent buffered at the source. The average packet latency in-

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Thr

ough

put (

frac

tion

of c

apac

ity)

Number of Virtual Channels

DSP kernelmultimedia kernel

self-similarrandom

Figure 5. Throughput variation with virtual channels

creases with the number of virtual channels and injectionload. In general low latency is achieved because of the sin-gle cycle router implementation (packet size is same as flitsize). For more realistic application specific traffic lowerlatencies are observed due to near neighbourhood traffic.

0

5

10

15

20

25

30

35

40

45

1 2 3 4 5 6 7 8

Ave

rage

Lat

ency

(cy

cles

)

Number of Virtual Channels

DSP kernelmultimedia kernel

self-similarrandom

Figure 6. Latency variation with virtual channels

Throughput of the network saturates beyond 4 virtualchannels with further increase in latency. To keep the la-tency low while maintaining acceptable throughput, thenumber of virtual channels is selected to be 4 in the finaldesign of the router which strikes an appropriate balancebetween high throughput, low latency and conservation ofsilicon area. This result is consistent with the previous workon number of virtual channels [3] and validates our simula-tion approach.

Figure 7 shows average network latency vs injected load(random traffic) for varying virtual channels. The latency

255

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ave

rage

Lat

ency

(cy

cles

)

Injected Load (fraction of capacity)

2 virtual channels4 virtual channels8 virtual channels

Figure 7. Latency variation with injected traffic

approaches infinity when injected traffic is beyond the sat-uration throughput. For example, in case of 4 virtual chan-nels saturation throughput is 62% of the total network ca-pacity (Figure 5). Beyond this point latency approachesinfinity (Figure 7) because of the limitation of routing re-sources.

3.2 Synthesis Results and Comparison

To get an estimate of the size and speed of the routers, wesynthesized our NoC design for 130nm UMC CMOS stan-dard cell library using Synopsys Design Compiler. Latency,area, power and delay numbers are tabulated for honeycomband mesh routers with four virtual channels.

Table 1. Comparison of routers in different topologies

Topology Latency Area Power Frequency(cycles) (mm2) (mW) (MHz)

Honeycomb 6 0.134481 36.33 500Mesh 4 0.153767 43.93 500

We compare the latency for traffic traces of DSP kernel,area and power figures of the router in honeycomb topologywith those of the router in mesh topology, both operating at500 MHz. As observed from Table 1 router in honeycombtopology has 12.5% area advantage and 17.3% advantage inpower over that of a router in mesh topology at the cost of30% increase in latency.

4 Conclusion

In this paper we have presented the architecture ofa single cycle NoC router operating at 500MHz as thecommunication backbone for a polymorphic ASIC. TheNoC router in such a polymorphic ASIC facilitates runtimereconfiguration of the hardware substrate to create ASIClike custom hardware regions for various applicationsubstructures. The NoC router design is optimized for areaand occupies less than 5% of the typical tile size of 3mm2

for NoC based SoCs. Network latency and throughput in-creases with a higher number of virtual channels. Using therouter with 4 virtual channels, the NoC gives lowest averagelatency of 6 clock cycles and 87% throughput for applica-tion specific traffic. Considerable improvement in latencyand throughput for application specific traffic shows thatour RECONNECT NoC is suitable for polymorphic ASICs.

Acknowledgments: This work was supported in part bythe Ministry of Communication and Information Technol-ogy, Govt. of India. Keshavan Varadarajan is supported byan IBM Ph.D. Fellowship.

References

[1] ITRS 2001. In International Technology Roadmap for Semi-conductors.

[2] P. Bai et al. A 65nm Logic Technology featuring 35nm gatelengths, Enhanced channel strain, 8 Cu Interconnect layers,Low-k ILD and 0.57µm2 SRAM cell. In IEDM ’04, pages657–660, Dec 2004.

[3] W. J. Dally. Virtual-Channel Flow Control. IEEE Trans.Parallel Distributed Syst., 3(2):194–205, 1992.

[4] G.Varatkar and R.Marculescu. Traffic Analysis for On-ChipNetworks Design of Multimedia Applications. In DesignAutomation Conference, pages 510–217. IEEE, June 2002.

[5] Kees Goossens et al. Ethreal Network on Chip: Concepts,Architectures and Implementation. In IEEE Design and Testof Computers. IEEE Computer Society, 2005.

[6] Mikael Millberg et al. Guarenteed Bandwidth using LoopedContainers in Temporally Disjoint Networks within the Nos-trun Network on Chip. In Proceedings of the Design, Au-tomation and Test. IEEE Computer Society, 2004.

[7] Mythri Alle et al. Synthesis of Application Accelerators onRuntime Reconfigurable Hardware. In ASAP ’08, 2008.

[8] K. Park and W.Willinger. Self-Similar Network Traffic andPerformance Evaluation. In John Wiley & Sons, 2002.

[9] A. N. Satrawala et al. Redefine: Architecture of a SOC Fab-ric for Runtime Composition of Computation Structures. InFPL ’07, 2007.

[10] I. Stojmenovic. Honeycomb Networks: Topological Proper-ties and Communication Algorithms. In IEEE ’97: ParallelDistributed Systems, volume 8, pages 1036–1042, 1997.

256

Date post:	12-Dec-2016
Category:	Documents
Upload:	ranjani
View:	213 times
Download:	1 times

[IEEE 2008 International Conference on Application-Specific Systems, Architectures and Processors...

Documents