RIDER: Ring deflection router with buffers

Des Autom Embed SystDOI 10.1007/s10617-014-9130-0

RIDER: Ring deflection router with buffers

Gadi Oxman · Shlomo Weiss

Received: 12 July 2013 / Accepted: 3 February 2014© Springer Science+Business Media New York 2014

Abstract The network-on-chip is becoming an increasingly important component of chipmultiprocessors. Recently bufferless deflection routers were proposed, aiming to reduce hard-ware cost in comparison to classic virtual channel based routers, by eliminating router buffers.We propose RIDER, a low cost deflection router based on an internal rotating ring structurewith minimal number of buffers. We compare RIDER with 16 buffers to a wormhole routerwith 12 buffers, a virtual channel buffered router with 64 buffers, to CHIPPER, a buffer-less deflection router with no buffers, and to MinBD, a buffered deflection router with fourbuffers.

Keywords NoC · Router architecture · Deflection routing

1 Introduction

An efficient, high performance, and low power network-on-chip (NoC) router is highly desir-able due to the increasing number of cores integrated on a single silicon die. In this paper wepropose RIDER, a new deflection router, and compare it with more common router architec-tures. To motivate deflection routing, we first describe two widely used router architectures,wormhole routing (WH) and virtual channel routing (VC).

1.1 Reducing power and silicon area

A WH router has B buffers in each input channel, arranged in a first-in-first-out (FIFO)configuration. Once the flit leaves the input buffer, a credit is sent to the previous router tonotify it about the newly available free buffer space. A VC router is similar to the WH router,

G. Oxman (B) · S. WeissSchool of Electrical Engineering, Tel Aviv University, 69978 Tel Aviv, Israele-mail: [email protected]

S. Weisse-mail: [email protected]

123

G. Oxman, S. Weiss

except that each input port has V virtual channels, where each VC has B buffers, for a totalof B × V buffers per physical input channel. The router must ensure that the neighbor routerhas buffer space before allowing a flit to win arbitration. Due to the length of the routerpipeline, link traversal, waiting for the neighbor router to send the flit, and credit processingdelay—to avoid performance loss the number of buffers has to be fairly large in both WHand VC architectures. Buffers consume significant silicon area and power.

1.2 Preventing deadlock and livelock

In the WH and VC routers, router A must ensure the availability of buffer space in its neighborrouter B before sending a flit from A to B. This creates a dependence between routers A andB and care has to be taken to avoid a deadlock situation, which may be caused by a circularchain of dependencies starting from router A and ending back in A. Such deadlock may beavoided by careful design of the routing algorithm.

A bufferless deflection router offers inherent deadlock freedom without additional specialcare even with adaptive algorithms. Router A can always send flits to router B without havingto check in advance that B is ready to accept them since B can always accept them. Thereis no dependence between router A and its neighbor router B and hence there is no circulardependence possible between router A and itself. A network deadlock is not possible. Onthe other hand, since flits are sometimes deflected (mis-routed) to non-minimal paths, theremay be a danger of a livelock, in which all flits travel the network without ever reaching theirdestination. How to prevent livelock is described in Sect. 4.

1.3 Deflection routing limitations

Although deflection routing offers several advantages as described earlier, two major prob-lems prevent its adoption. The first problem concerns a limitation on the operating frequencyof the router due to a long critical path in the port allocator. Since in a bufferless deflectionrouter there are no buffers, the allocator must allocate an output port to each of the incomingflits, since they must exit the router. The allocation problem is therefore harder than in a VCbased router, which can decide to leave some flits in the buffers.

The second problem experienced by deflection routing is low performance and high powerconsumption under heavy load. Although packets are seldom deflected when the network islightly utilized, under heavy load the deflection rate increases due to higher port contention.As load increases, performance is reduced, and power is simultaneously increased due toadditional hops each packet travels in the network.

1.4 Contributions

We propose a low cost router architecture, RIDER, that combines deflection and buffers. Thispaper makes the following contributions:

1. We present a solution to the long port allocator critical path in deflection routing using treearbiters that are local to each port and compare the suggested solution to the previouslyproposed serial port allocation and permutation network.

2. We demonstrate using both synthetic traffic and application benchmarks that RIDERhas performance and energy efficiency benefits in comparison to previously proposedrouters.

123


1.5 Paper overview

Section 2 describes relevant earlier work. The proposed RIDER architecture is explained inSect. 3. Deadlock and livelock issues are discussed in Sect. 4. Section 5 on the evaluationmethodology is followed by Sect. 6, which reports the results and discusses tradeoffs. Thepaper ends with conclusions in Sect. 7.

2 Related work

Several approaches were proposed in an effort to reduce the number of buffers and theirpower consumption in WH and VC routers. In iDEAL [7], the authors propose to use adap-tive dual-function links capable of data transmission as well as data storage when required.In ViChaR [15], the authors propose to use a unified buffer structure, which dynamicallyallocates virtual channels according to network traffic. Ramanujam et al. [18] propose adistributed shared buffer router structure containing two crossbar stages with buffering sand-wiched in between. Express virtual channels (EVCs) were proposed [9] to connect distantnodes such that intermediate routers are entirely bypassed. Speculation [14,17] and advancedbundle signals [10] were proposed to shorten the router pipeline length. Michelogiannakiset al. [12] advocate the use of custom SRAM-based buffers and empty buffer bypassing toreduce power consumption.

In an effort to further reduce the power consumption of buffered VC routers, recent workproposed using a bufferless deflection router [11]. Unnecessary deflections increase networkpower due to an increase of the number of hops that a flit travels in its way to its destination,but it has been shown [13] that under light network load, the number of deflections is quitelow and therefore power savings are possible.

Several schemes were proposed to avoid livelock. One scheme relies on statistical argu-ments that livelock probability is 0 [8]. Another deterministic scheme uses age (the numberof clock cycles the flit is in the network) prioritization [13]. The oldest flit in the network isgiven priority and always travels along the shortest path to destination. Thus livelock is elim-inated. Recently, a new deterministic livelock prevention scheme proposed using a goldenflit prioritization [2], in which a single flit is chosen as a golden and given priority over allother flits in the network. Periodically a new golden flit is selected.

Regarding the allocation problem, early attempts proposed using serial port allocation [13],in which each flit is considered serially and depends on the output ports allocated to the previ-ous flits. However, serial allocation introduces a long critical path in the allocator logic, reduc-ing the maximum operating frequency of the router. Recent work [2] recommended to replacethe serial port allocation with a faster permutation network. Several works proposed adding asmall number of buffers to deflection routers [3,6,16], thus reducing deflections. However itis not clear what is the most efficient router organization that optimizes the use of the buffers.

3 Router architecture

RIDER is a buffered deflection router. As long as buffer space is available, it attempts tohold flits in buffers rather than deflecting them. Flits may be deflected if the buffers are full.RIDER does not have the crossbar used in some deflection routers. Instead, the port blocksare connected in a ring structure, and flits travel from one port to the next in the clockwisedirection, in search for a productive exit port.

123

G. Oxman, S. Weiss

South buffer 2

South buffer 1

Tree arbiter mux 1

South buffer 4

South buffer 3

Tree arbiter mux 2

Tree arbiter mux 3

Bypass mux

North input

Input mux

Local input

Ring mux 3

East buffer 1

East buffer 2

East buffer 3

East buffer 4

Ring mux 2

Ring mux 4

Ring mux 1

South output reg

Local eject (south)

EAST

SOUTH

WEST

NORTHLOCAL

South port detailed architecture

OverallRouter architecture

South output

East port(only ring buffers

shown)RoutingComputation

Productive bits

Tree Arbiter

Buffers

RingMux

Fig. 1 Router architecture. The bottom right shows the overall architecture, while the rest shows the southport architecture in detail

Figure 1 illustrates the overall router architecture and the detailed architecture of oneport. RIDER consists of Np identical ports connected in a ring structure. Each port has B flitbuffers. The router is also connected to the local processor. For the sake of brevity we considera 2D mesh with Np = 4 neighbors (NORTH, EAST, SOUTH, WEST), B = 4 buffers perport, and focus on the SOUTH port. Figure 1 shows the detailed architecture of the south portblock. The other ports have the same architecture. The port contains the following elements:Input mux, Routing computation, Ring mux, Buffers, Tree arbiter, Bypass mux, and Outputregister.

3.1 Input mux

The input multiplexer has two flit inputs: one from a neighbor, and one from the local port.Each port has a preferred neighbor. In the illustrated case for the south port, the neighborinput into the mux is the opposite north input. The opposite input is chosen to allow flitsto travel often in a straight line to their destination, without having too many ring rotations.The local input is used to inject new flits from the local processor into the router. The inputfrom the local port is connected to all Np port blocks, but the control logic ensures that itis selected in only one mux at most. Flits that are already in the network have priority overinjection of new flits. The local input may only select a port that does not have a flit comingin from a neighbor router.

3.2 Routing computation

On arrival to the router, each incoming flit passes through a routing computation (RC) block.The RC block looks at the destination of the flit and based on the router location computes a

123


vector of four bits, one for each of the four exit ports. A routing bit is set to ‘productive’ ifthe particular port it represents is productive for the flit and ‘non-productive’ otherwise. Upto two bits of 4-bit vector may be set to the ‘productive’ state, corresponding to the xy and yxroutes to the destination. The routing bits remain attached to the flit and rotate with it alongthe ring until the flit leaves the router.

3.3 Buffers, ring mux, and ring rotations

Each port contains B buffers. In the front of each buffer, a ring multiplexer (refer to Fig. 1)controls the flit that may be written to that buffer in any particular cycle. The multiplexer hastwo sources. The right input implements the ring structure. It is connected to the correspondingbuffer output of the previous port. The left input of the ring mux is connected to the outputof the input mux. It is used to accept new flits either from the neighbor or from the localinput.

3.4 Bypass mux and output register

The bypass mux is used to allow a direct connection of the input mux to the output reg-ister, bypassing the buffers and the tree arbiter. Its left input is used to reduce routerlatency when the buffers are empty, and it is productive for the flit at the input mux toexit the router from that port. The right input of the mux connects to the tree arbiter andis used when one of the buffered flits is selected to be output from the router. The out-put of the mux is stored in the output register, which is connected by a link to a neighborrouter.

3.5 Tree arbiter

Each port has a local tree arbiter. The arbiter selects the best flit out of the B flits storedin the local buffers to send out through the port (including empty buffers). The arbiteris constructed using arbiter blocks in a tree architecture with log2 B levels. Each arbiterblock has two inputs for the two candidates and one winner output. In the example casewith four buffers, the tree has two levels: the first level selects a winner for each ofthe two pairs and the second level selects a single winner between the two level oneinputs.

Table 1 specifies the 2:1 arbiter block logic table. The arbiter is designed such that a buffercontaining a productive flit has priority over an empty buffer, which in turn has priority overa buffer containing a non-productive flit. Therefore, if the flit chosen by the tree arbiter isproductive, it is sent to the output link on its shortest path to the destination. If no flits areproductive but there are still empty buffer slots, an empty buffer wins arbitration and no flitis output through the port. This allows the non-productive flits to continue circling the ringin search for a productive port instead of being deflected. Only if all the buffers are full withnon-productive flits, one of them is deflected through the port to make buffer space availablefor new flits incoming to the router. The two “tie” cases P/P and NP/NP are resolved byone of four tie resolution rules: fixed, random, age or golden. The fixed rule always selectsthe left flit. The random rule randomly selects either the left or the right flit. The age ruleselects the older flit when arbitrating between two productive flits and the younger flit whenarbitrating between two non-productive flits, ensuring the older flit rotates to a productiveport. Similarly, the golden rule selects the golden flit, if any, ensuring that the golden flitrotates to a productive ports.

123

G. Oxman, S. Weiss

Table 1 Arbiter block winnerselection logic, as a function ofthe left and right flit states

P buffer contains a productiveflit, NP buffer contains anon-productive flit, E buffer isemptyWhen both flits are either P orNP, tie resolution is performedaccording to one of the four tieresolution rules: fixed, random,age or golden

Left Right Winner

Fixed Random Age Golden

P P Left Random Older Golden

P E Left

P NP Left

E P Right

E E Left

E NP Left

NP P Right

NP E Right

NP NP Left Random Younger Non-golden

4 Deadlock and livelock freedom

As in the bufferless deflection router, RIDER’s architecture maintains the characteristic thatits decisions are local to the router. Each router can decide to send a flit to its neighbor withoutchecking in advance whether that neighbor has buffer space available.

We first show that there is always buffer space available for an incoming flit even withoutchecking that space is available in advance. Without loss of generality, suppose a flit is comingin from the north neighbor. As described earlier, it will be assigned to the south port block,shown in Fig. 1. After passing through the input mux, there are two cases: when the south portis productive and there are no flits in the buffers, the incoming flit exits the router. Otherwise,the flit will be available at the input of the ring multiplexers. We first show that there are atmost 3 (B − 1 in the general case) flits rotating through the ring from the east port, leavingat least one buffer space into which we can write the incoming flit from the north neighbor.This is because if we look at the buffers in the previous east ports, even when all four buffersare full, the tree arbiter as specified in Table 1 selects one flit to output, even if it is deflected.The selected flit exits the router leaving only three flits to rotate through the ring from theeast port to the south port. Therefore, only three ring multiplexers in the south port choosetheir rightmost input (taking flits from the ring), leaving one multiplexer free to select theincoming flit through.

Having shown that a flit incoming to the router from one of the neighbors is never dropped,we now show that the network is deadlock free. Suppose the network is deadlocked for anindefinite amount of time such that no flit makes progress. If a router port, anywhere in thenetwork, has B flits stored in its buffers, one of them will be selected by the tree arbiter andexit the router, contradicting the network deadlock assumption. We conclude that all portsin the network have B − 1 flits at most. Further, all flits have to be in the NP state, sinceotherwise the tree arbiter, as specified in Table 1, would select one of the productive flits,because P wins over NP and over E. At the end of the cycle ring rotation is performed. In theworse case, within P − 1 cycles, the flits would perform a full cycle, having visited everyport. Within P − 1 cycles at least one flit would reach a productive port, be selected by itstree arbiter, and exit the router, contradicting the network deadlock assumption. Thereforenetwork deadlock is not possible.

We now show that neither livelock is possible when the age tie resolution rule is used.Suppose the network is in a livelock, such that flits travel in the network without ever reachingtheir destination. Let’s pick the oldest flit O in the network. Mark the shortest distance

123


Table 2 System simulationparameters Topology 2D mesh: 8 × 8 synthetic,

4 × 4 applicationbenchmarks

Technology node 45 nm SOI

Frequency 1 GHz

Flit width 64 bits

Inter-router distance 2 mm

WH router 12 buffers (3 buffers per port, no VC)

VC router 64 buffers (4 VC, 4 buffers in each)

CHIPPER router 0 buffers (bufferless)

MinBD router 4 buffers (side buffer)

RIDER router 16 buffers, 4 per port (tree arbiters)

Synthetic patterns random, transpose, tornado

Applications SPLASH-2: barnes, cholesky, fft, lu,ocean, radiosity, radix, raytrace

Processor 16 cores

Core x86, 4-way issue, 128 entry ROB

Branch predictor Pentium M, 17 cycles penalty

L1-I 8 KB, 4 way, 4 cycle access time

L1-D 8 KB, 8 way, 4 cycle access time

L2 cache 64 KB per core, 8 way, 8 cycle

L3 cache 256 KB per core, 16 way, 30 cycles

Main memory 65ns access time, 8 GB/sec per core

between the oldest flit and its destination by H . We now show that each router sends Othrough a port that decreases H . Referring to Fig. 1, if O is in the output register, it exitsthe router decreasing H by one. If O is at the router input, it either exits the router throughthe bypass mux if no buffers are used and that port is productive, decreasing H , or it isstored in the buffers. Assume O is in one of the buffers. Its state may be either P or NP.If it is P, it is productive to send O through this port. Because the age resolution rule isused, O wins against other flits in the P state (see Table 1), exits the router, decreasing H .Now consider the NP case. According to the tree arbiter logic table, NP looses arbitrationagainst P and against E. O loses arbitration, does not exit the router through this port, andcontinues cycling the ring. However, before performing a full rotation in the ring, O mustencounter a productive port. When it does, it wins arbitration and exits the router. The sameargument may be repeated until O reaches its destination. We conclude that livelock is notpossible.

5 Evaluation methodology

We compare RIDER to a classic WH and VC routers [17], to CHIPPER [2], a bufferlessdeflection router, and to MinBD [3], a deflection router with four buffers. We compareboth performance and power. We evaluate the routers using synthetic traffic, as well ascache coherence NoC traffic generated by the SPLASH-2 [20] application benchmarks. Thesimulated system parameters are summarized in Table 2.

123

G. Oxman, S. Weiss

5.1 NoC simulators

The WH and VC routers are simulated using the Booksim NoC simulator [5]. CHIPPER,MinBD and RIDER are simulated using our own NoC simulator.

5.2 Synthetic benchmarks

For the synthetic benchmarks, we use the NoC simulators in stand-alone open-loop mode,where each router is connected to a traffic generator using an unbounded FIFO. Theunbounded FIFO allows us to independently control the offered load to the network. Weuse three synthetic traffic patterns: uniform random, transpose, and tornado. In the uniformrandom traffic, the destination router is randomly chosen. In the transpose traffic, each proces-sor at address {x, y} generates flits to the diagonally opposite router at destination {y, x}. Inthe tornado traffic for a N × N network, each processor at address {x, y} sends flits half-wayacross the network to address {(x + N

2 − 1) mod N , (y + N2 − 1) mod N }.

5.3 Latency

We measure for each flit the total latency from the time of flit generation, till the time the flitarrives to its destination. This time includes both the time the flit spends in input FIFO andthe time it spends traversing the network.

5.4 Area and power estimation

We use DSENT [19] to estimate area and power. DSENT contains dual models for photonicand electronic routers, we use its electronic router model in this paper. Power consumption isbroken down to three components: static power is due to leakage and is always dissipated evenif the network is not being used. We split dynamic power to two components: dynamic routeris the power dissipated within the router due to buffers, crossbar traversals, and ring rotations.Dynamic link is the power dissipated due to flits traversing the links as they move from onerouter to its neighbor. The NoC simulator tracks the total number of buffer writes, ringrotations, crossbar traversals, and link traversals and passes them into DSENT to generatea dynamic energy estimation. Dynamic power is then calculated by dividing the dynamicenergy by the simulation time and total network power is calculated by adding the staticpower component.

5.5 Critical path estimation

We model port allocation in detail in the Verilog hardware description language and performlogic synthesis to estimate the length of the critical path. For synthesis we use Cadence RTLcompiler version v12.10-p006_1 with the FreePDK 45 nm cell library.

5.6 Application benchmarks

We evaluate the router using eight multi-threaded applications from the SPLASH-2 bench-mark [20] with the configuration parameters specified in Table 3. The benchmarks are simu-lated with the SNIPER multicore simulator [1]. SNIPER warms up the benchmarks, beforetaking measurements over parallel regions-of-interest. For the application benchmarks theNoC simulators run in a closed loop synchronized mode, where NoC traffic is accepted fromSNIPER instead of the synthetic traffic generators. The latency of each packet is fed back to

123


Table 3 Application benchmarksettings Barnes 16,384 particles

Cholesky tk25.O

fft 256K points

Lu 512 × 512 matrix

Ocean 256 × 256 ocean

Radiosity -Room -ae 5000.0 -en 0.050 -bf 0.10

Radix 256K integers

Raytrace Car -m64

Table 4 Routers area (µm2) WH VC CHIPPER MinBD RIDER

18,607 52,831 7,991 10,013 16,080

Table 5 Port allocation criticalpath length in deflection routers(picoseconds)

Serial (BLESS) Permutation network(CHIPPER, MinBD)

Tree arbiter (RIDER)

736 454 346

SNIPER, such that network congestion effects are modeled and slow down application runtime.

6 Evaluation and discussion

Table 4 shows area estimation for the five routers. Due to the additional buffers, RIDER with16 buffers is about double the size of CHIPPER with no buffers, 60 % bigger than MinBD,and about 30 % of the size of the VC router with 64 buffers.

As described in the introduction, deflection routers suffer from long critical paths in theport allocator. Table 5 compares the critical path length of three allocators used in deflectionrouting: a serial port allocator, which is part of the allocator used in BLESS [13], a permutationnetwork allocator used by both CHIPPER [2] and MinBD [3], and RIDER’s tree arbiter.CHIPPER and MinBD’s permutation network does not permit all input-output combinations,resulting in unnecessary deflections that reduce performance and increase power.

RIDER solves the long critical path problem by using buffers and tree arbiters, withouthurting the efficiency of the allocation as in the permutation network. The rotation of the flitsin the ring buffers allows each port to be handled separately, converting the serial allocationproblem into parallel tree arbiters. The allocator critical path of the tree arbiter is about23.8 % shorter than the permutation network and 53 % shorter than serial allocation.

Figure 2 shows latency curves as a function of the offered load for the three synthetic trafficpatterns. RIDER sustains a higher network load and smaller zero-load latency compared to theother routers. Using fewer buffers RIDER outperforms the VC router. Especially noteworthyare the adversary tornado and transpose traffic patterns. WH or VC routers need additionalbuffers to compensate for the credit traversal time and processing delay. This is shown by thereduced performance of the WH router with only 12 buffers. On the other hand RIDER uses

123

G. Oxman, S. Weiss

0 5

10 15 20 25 30 35 40

random

0 5

10 15 20 25 30 35 40

Late

ncy

(cyc

les)

transpose

0 5

10 15 20 25 30 35 40

0.0 0.1 0.2 0.3 0.4 0.5

Offered load [flits/cycle/node]

tornadoVCWH

CHIPPERMinBD

RIDER

Fig. 2 Latency under synthetic traffic

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5

defle

ctio

ns/fl

it


VCWH

CHIPPERMinBDRIDER

Fig. 3 Average deflections per flit under uniform random traffic

deflection and routing decisions are made locally. Figure 3 demonstrates RIDER’s reduceddeflection rate compared to CHIPPER and MinBD under uniform random traffic. AlthoughMinBD reduces the average number of deflections per flit compared to CHIPPER, RIDERreduces it further and is able to sustain a higher network load.

The ring configuration in RIDER and the parallel tree arbiters offer more choices inselecting productive port allocations. In MinBD in each cycle the permutation network onlyconsiders four candidates and as a result its performance does not improve when the numberof buffers is increased, as shown in Fig. 4. In contrast, RIDER’s performance ramps upquickly when the number of buffers is increased. With 16 buffers, RIDER’s performance isabout 97 % of its performance with 64 buffers. RIDER maintains the highest performance

123


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

thro

ughp

ut [f

lits/

cycl

e/no

de]

Number of router buffers

VCWH

CHIPPERMinBDRIDER

Fig. 4 Throughput as a function of the number of buffers for a 8 × 8 mesh under uniform traffic

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4 5 6 7 8 9 10 11 12 13 14 15 16

thro

ughp

ut [f

lits/

cycl

e/no

de]

Mesh dimension

VCWH

CHIPPERMinBDRIDER

Fig. 5 Throughput as a function of 2D mesh topology dimension under uniform traffic

when the network size increases, as shown by Fig. 5 for varying mesh dimensions between4 × 4 with 16 routers, up to 16 × 16 mesh with 256 routers.

Figure 6 shows a breakdown of the network power under random traffic to three compo-nents: static, dynamic router, and dynamic link. The power is shown at three network loadlevels: light traffic (0.1), medium traffic (0.25), and heavy traffic (0.4). When the network islightly utilized, the main power is due to static leakage power and the bufferless CHIPPERrouter demonstrates the lowest power. At medium load dynamic power is increased: both thedynamic router component dissipated within the router and more significantly the dynamiclink component. At that point RIDER has the lowest power, while MinBD is slightly higher.CHIPPER dynamic power is substantially higher due to the dynamic link component. Atheavy load RIDER still has the lowest power compared to VC. WH, CHIPPER and MinBDresults are not available for heavy load since, as shown in Fig. 2, they are unable to sustainthat load.

Figure 7 shows that under light load the per-bit energy is high for all routers, because staticpower is dissipated anyway and the traffic consists of only a small number of bits. As the loadgets higher the network is used more efficiently. The efficiency of the WH and VC routerskeeps improving until the cutoff point is reached. For the deflection routers the energy graphbehaves differently: the increase of the per-bit energy correlates to the rise of the deflection

123

G. Oxman, S. Weiss

0

0.5

1

1.5

2

2.5

3

WH

VC CHIPPER

MinBD

Rider

WH

VC CHIPPER

MinBD

Rider

WH

VC CHIPPER

MinBD

Rider

Net

wor

k po

wer

(W

)

Network power

static routerdynamic router

dynamic link

Heavy loadMedium loadLight load

Fig. 6 Network power breakdown to components under synthetic random traffic for light load (0.1flits/cycle/node), medium load (0.25) and heavy load (0.4). WH, CHIPPER and MinBD aren’t shown forheavy load since they can not sustain that load

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5

Ene

rgy

[pJ/

bit]


VCWH

CHIPPERMinBDRIDER

Fig. 7 Network energy per delivered bit under synthetic random traffic

rate at heavy loads as shown in Fig. 3. Comparing the routers, CHIPPER is slightly moreefficient under low loads, but under medium and especially under heavy loads the bufferedring structure of RIDER is able to eliminate many of the deflections in both CHIPPER andMinBD and achieves lower per-bit energy.

The normalized speedup of WH, CHIPPER and RIDER relative to the VC router for eachof the SPLASH-2 benchmarks in shown in Fig. 8. Two of the benchmarks (barnes, lu) arenot sensitive to the router choice, while on the other six the router choice has an effect onthe benchmark run time. On the eight benchmarks, on average RIDER outperforms WH by31.33 %, outperforms VC by 5.82 %, outperforms CHIPPER by 7.96 %, and outperformsMinBD by 3.05 %. Figure 9 shows the network power dissipated by the routers while runningthe SPLASH-2 benchmark. On average, RIDER consumes 66.37 % less power than VC,5.95 % less power than WH, 8.63 % more power than CHIPPER, and 13.18 % more powerthan MinBD.

On the SPLASH-2 benchmarks, RIDER improves performance by 7.96 % relative toCHIPPER, at a cost of 8.6 % increased network power. Note however that a multicore chipconsists of cores, caches, memory controllers, and network, and the network power by itself

123


0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

barnes

cholesky

fft lu ocean

radiosity

radixraytrace

avg

Spe

edup

vs

VC

VCWH

CHIPPERMinBD

RIDER

Fig. 8 Application speedup of WH, CHIPPER, MinBD and RIDER relative to the VC router

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

barnes

cholesky

fft lu ocean

radiosity

radixraytrace

avg

Net

wor

k po

wer

(W

)

VCWH

CHIPPERMinBD

RIDER

Fig. 9 Application network power

is only a fraction of the total power spent by the multicore chip. For example in Intel’s 48-coreSCC the NoC consumes 10 % of the total chip’s power [4]. Assuming the NoC uses 10 % ofthe multicore chip’s power as in Intel’s SCC, the 8.6 % additional network power correspondsto an 0.86 % increase in the total multicore power. Taking into account that RIDER shortensthe execution time of the applications by 7.96 %, in spite of the slight increase of the power,RIDER reduces the total energy required to run the SPLASH-2 benchmarks. Hence we getboth benefits, higher performance and reduced energy.

A lightly loaded bufferless router such as CHIPPER does not experience many deflectionsand operates at higher energy efficiency while maintaining the same performance. For heaviertraffic CHIPPER exhibits increased deflection rate. On the other hand, a buffered deflectionrouter reduces the number of deflections and operates with higher performance and betterenergy efficiency. The average network load does not present the whole picture. Rather,variation in the instantaneous load must be considered because applications often alternatebetween periods of low and heavy network activity, as demonstrated by Fig. 10 for the FFTbenchmark. During high network activity periods, RIDER sustains higher throughput andshortens the overall execution time of the multicore processor.

123

G. Oxman, S. Weiss

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.2 0.4 0.6 0.8 1

thro

ughp

ut (

flits

/cyc

le/n

ode)

simulation time (normalized)

CHIPPER MinBD RIDER

Fig. 10 Network achieved throughput as a function of time during the FFT benchmark on 16 processors

7 Conclusions

We presented a new router architecture that combines deflections and buffers. The buffersare arranged in a rotating ring structure such that each section of the ring is assigned to asingle port. Under light load, bypass is used to quickly route most of the incoming packets toa productive destination port. As load increases, many of the incoming packets are stored inbuffers and rotate in the internal ring on their way to a productive port. The remaining packetsfor which a buffer slot is not available are deflected. The proposed parallel tree arbiters solvethe critical path problem in the port allocator of deflection routers.

We compared RIDER to four previously proposed routers: WH, VC, CHIPPER, andMinBD. The performance evaluation shows that RIDER outperforms the other routers underall tested conditions. Evaluation with synthetic traffic patterns shows that RIDER’s networkpower consumption is lower than the other routers under medium and high load, but slightlyhigher than CHIPPER and MinBD under light load. Evaluation with the SPLASH-2 applica-tion benchmarks, taking into account the run time of the applications, network power, and atypical 10 % ratio of network power to total chip power shows that RIDER reduces the totalenergy required to run the benchmarks compared to other routers.

References

1. Carlson TE, Heirman W, Eeckhout L (2011) SNIPER: exploring the level of abstraction for scalable andaccurate parallel multi-core simulation. In: International conference for high performance computing,networking, storage and analysis, p 52

2. Fallin C, Craik C, Mutlu O (2011) CHIPPER: a low-complexity bufferless deflection router. In: 17thinternational symposium on high performance computer, architecture, pp 144–155

3. Fallin C, Nazario G, Yu X, Chang K, Ausavarungnirun R, Mutlu O (2012) MinBD: Minimally-buffereddeflection routing for energy-efficient interconnect. In: Sixth international symposium on networks-on-chip, pp 1–10

4. Howard J, Dighe S, Vangal SR, Ruhl G, Borkar N, Jain S, Erraguntla V, Konow M, Riepen M, GriesM et al (2011) A 48-core ia-32 processor in 45 nm cmos using on-die message-passing and dvfs forperformance and power scaling. IEEE J Solid State Circuits 46(1):173–183

5. Jiang N, Becker DU, Michelogiannakis G, Balfour J, Towles B, Shaw D, Kim J, Dally W (2013) Adetailed and flexible cycle-accurate network-on-chip simulator. In: IEEE international symposium onperformance analysis of systems and software (ISPASS), pp 86–96

123


6. Jose J, Nayak B, Kumar K, Mutyam M (2013) DeBAR: deflection based adaptive router with minimalbuffering. In: Conference on design, automation and test in, Europe, pp 1583–1588

7. Kodi AK, Sarathy A, Louri A (2008) iDEAL: Inter-router dual-function energy and area-efficient linksfor network-on-chip (NoC) architectures. ACM SIGARCH Comput Archit News 36:241–250

8. Konstantinidou S, Snyder L (1994) The chaos router. IEEE Trans Comput 43(12):1386–13979. Kumar A, Peh LS, Kundu P, Jha NK (2007) Express virtual channels: towards the ideal interconnection

fabric. ACM SIGARCH Comput Archit News 35:150–16110. Kumary A, Kunduz P, Singhx A, Pehy LS, Jhay N (2007) A 4.6 tbits/s 3.6 ghz single-cycle noc router

with a novel switch allocator in 65nm cmos. In: 25th international conference on computer design, pp63–70

11. Lu Z, Zhong M, Jantsch A (2006) Evaluation of on-chip networks using deflection routing. In: 16th ACMgreat lakes symposium on VLSI, pp 296–301

12. Michelogiannakis G, Sanchez D, Dally WJ, Kozyrakis C (2010) Evaluating bufferless flow control foron-chip networks. In: Fourth international symposium on networks-on-chip, pp 9–16

13. Moscibroda T, Mutlu O (2009) A case for bufferless routing in on-chip networks. ACM SIGARCHComput Archit News 37(3):196–207

14. Mullins R, West A, Moore S (2004) Low-latency virtual-channel routers for on-chip networks. ACMSIGARCH Comput Archit News 32(2):188

15. Nicopoulos CA, Park D, Kim J, Vijaykrishnan N, Yousif MS, Das CR (2006) Vichar: a dynamic virtualchannel regulator for network-on-chip routers. In: 39th annual international symposium on microarchi-tecture, pp 333–346

16. Oxman G, Weiss S, Birk YT (2012) Buffered deflection routing for networks-on-chip. In: Interconnectionnetwork architecture: on-chip, multi-chip, workshop, pp 9–12

17. Peh LS, Dally WJ (2001) A delay model for router microarchitectures. IEEE Micro 21(1):26–3418. Ramanujam RS, Soteriou V, Lin B, Peh LS (2010) Design of a high-throughput distributed shared-buffer

NoC router. In: Fourth international symposium on networks-on-chip, pp 69–7819. Sun C, Chen CHO, Kurian G, Wei L, Miller J, Agarwal A, Peh LS, Stojanovic V (2012) DSENT-A tool

connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In: Sixthinternational symposium on networks-on-chip, pp 201–210

20. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization andmethodological considerations. ACM SIGARCH Comput Archit News 23(2):24–36

123

Date post:	21-Dec-2016
Category:	Documents
Upload:	shlomo
View:	213 times
Download:	0 times

RIDER: Ring deflection router with buffers

Documents