Express Virtual Channels: Towards the Ideal Interconnection ...kubitron/cs258/...Express Virtual...

Express Virtual Channels: Towards the IdealInterconnection Fabric

Amit Kumar†, Li-Shiuan Peh†, Partha Kundu‡ and Niraj K. Jha††Dept. of Electrical Engineering, Princeton University, Princeton, NJ 08544

‡Microprocessor Technology Labs, Intel Corp., Santa Clara, CA 95052†{amitk, peh, jha}@princeton.edu, ‡[email protected]

ABSTRACTDue to wire delay scalability and bandwidth limitations in-herent in shared buses and dedicated links, packet-switchedon-chip interconnection networks are fast emerging as thepervasive communication fabric to connect different process-ing elements in many-core chips. However, current state-of-the-art packet-switched networks rely on complex routerswhich increases the communication overhead and energyconsumption as compared to the ideal interconnection fab-ric.

In this paper, we try to close the gap between the state-of-the-art packet-switched network and the ideal intercon-nect by proposing express virtual channels (EVCs), a novelflow control mechanism which allows packets to virtually by-pass intermediate routers along their path in a completelynon-speculative fashion, thereby lowering the energy/delaytowards that of a dedicated wire while simultaneously ap-proaching ideal throughput with a practical design suitablefor on-chip networks.

Our evaluation results using a detailed cycle-accurate sim-ulator on a range of synthetic traffic and SPLASH bench-mark traces show upto 84% reduction in packet latency andupto 23% improvement in throughput while reducing the av-erage router energy consumption by upto 38% over an exist-ing state-of-the-art packet-switched design. When comparedto the ideal interconnect, EVCs add just two cycles to theno-load latency, and are within 14% of the ideal throughput.Moreover, we show that the proposed design incurs a mini-mal hardware overhead while exhibiting excellent scalabilitywith increasing network sizes.Categories and Subject Descriptors: C.2.1 [ComputerSystems Organization]: Network Architecture and De-sign - Packet-switchingGeneral Terms: Design, Management, PerformanceKeywords: Flow control, Packet-switching, Router design

1. INTRODUCTIONDriven by technology limitations to wire scaling and in-

creasing bandwidth demands [1,2], packet-switched on-chipnetworks are fast replacing shared buses and dedicated wiresas the de facto interconnection fabric in general-purposechip multi-processors (CMPs) [3, 4] and application-specificsystems-on-a-chip (SoCs) [5–8]. While there has been sig-nificant work on interconnection networks for multiproces-sors, design of on-chip networks, which face unique designconstraints, is a relatively new research area. Ultra-low la-tency and scalable, high bandwidth communication is criti-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ISCA’07, June 9–13, 2007, San Diego, California, USA.Copyright 2007 ACM 978-1-59593-706-3/07/0006 ...$5.00.

cal in on-chip communication fabrics in order to support awide range of applications with diverse traffic characteris-tics. Moreover, these fabrics need to adhere to tight areaand power budgets with tractable hardware complexity.

Modern state-of-the-art on-chip network designs use a mod-ular packet-switched fabric in which network channels areshared over multiple packet flows. Even though this en-ables high bandwidth, it comes with a significant delay, en-ergy and area overhead due to the need for complex routers.Packets need to compete for resources on a hop-by-hop ba-sis while going through a complex router pipeline beforetraversing the output link at each intermediate node alongtheir path. Thus, the packet energy/delay in such net-works is dominated largely by contention at intermediaterouters, resulting in a high router-to-link energy/delay ra-tio. In other words, the gap between current state-of-the-artnetworks and the ideal interconnect, in which all nodes areconnected by pair-wise dedicated wires, is quite large.

In this work, we propose express virtual channels (EVCs),a novel flow control and router microarchitecture designwhich tries to close the performance and energy gaps be-tween the state-of-the-art packetized on-chip network andthe ideal interconnection fabric. EVC-based flow controlapproaches the delay and energy of a dedicated link by al-lowing packets to virtually bypass intermediate routers alongpre-defined virtual express paths between pairs of nodes.Thus, EVCs allow packets to skip the entire router pipelineat intermediate nodes and approach the energy/delay of adedicated wire interconnect. Intuitively, this is achieved bystatically designating a set of EVCs at each router that al-ways connect nodes A and B that are k hops away, andprioritizing EVCs over normal virtual channels (NVCs) [9]at the intermediate nodes. For instance, Fig. 1(a) shows a7×7 2D mesh network with three-hop EVCs (k = 3), withthe dotted lines depicting EVCs, where EVCs are not addi-tional physical channels, but virtual channels (VCs) [9] thatshare existing physical links. Traveling from node 00 to node03 can then be done through an EVC which virtually skipsthe router pipelines at nodes 01 and 02. Fig. 1(b) showsan example of a typical packet route using EVCs where apacket traveling from node 01 to node 56 skips the routerpipeline at nodes 04, 05, 16 and 26. Moving further, thiswork also proposes dynamic EVCs which use EVCs of vary-ing lengths. With dynamic EVCs, packets at any node areallowed to choose among a range of EVCs and hop onto theone which is most suitable along their route towards theirdestination, thereby maximizing the use of EVCs.

In addition to lowering latency to that of a dedicatedlink, EVCs also reduce the amount of buffering and av-erage router switching activity which makes them energy-and area-efficient. Moreover, as EVCs skip through arbi-tration at intermediate nodes, they reduce contention andpush throughput. Hence, EVCs lead to savings in latency,throughput as well as energy, unlike prior work which tendsto trade off one for the other (see Section 6 for a detaileddiscussion). Circuit switching [10] allows communicationsto approach the latency of dedicated wires (if the costlysetup delay can be amortized) but, by dedicating physical

bandwidth to a message flow, suffers in throughput. Simi-larly, express cubes [11], in using physical express channels,trades off throughput when the express channels are nothighly utilized. While recent work has proposed specula-tion [12, 13] to cut down the router critical path delay byparallelizing multiple pipeline stages, such techniques showdiminishing returns with increasing network traffic whenthe speculation failure rate becomes high. EVCs, by virtu-ally bypassing nodes in a non-speculative fashion, overcomethese problems and are able to simultaneously approach theenergy/delay/throughput of the ideal interconnection fabric.

In this paper, we first motivate the need for EVCs byhighlighting the existing gap between current state-of-the-art packetized networks and the ideal network, both in termsof performance and energy consumption (Section 2). Wethen explain the details of EVCs and its dynamic variant inSection 3, before presenting the detailed microarchitectureand circuit schematics of the EVC router in Section 4. Weevaluated EVCs using a cycle-accurate network simulatorconsidering both synthetic traffic and traffic traces gatheredfrom the execution of the SPLASH-2 benchmark suite [14].Our results in Section 5 show upto 84% reduction in packetlatency and upto 23% improvement in throughput while re-ducing the average router energy consumption by upto 38%as compared to an existing state-of-the-art packet-switcheddesign. Section 6 contrasts EVCs with prior related workwhile Section 7 concludes the paper.

2. MOTIVATIONIn this section, we present a motivating case study that

highlights the latency-throughput performance and energygap between the ideal interconnection fabric and an exist-ing baseline design that incorporates several state-of-the-artrouter microarchitectural features that were recently pro-posed to tackle network latency, throughput and energy.

2.1 Baseline state-of-the-art routerFig. 2(a) shows the microarchitecture of our baseline state-

of-the-art VC router. For simplicity, we assume a two-dimensional mesh topology throughout this paper, thoughthe router microarchitectures presented readily extend toother topologies. Thus, the router has five input and out-put ports corresponding to the four neighboring directionsand the local processing element (PE) port. The major com-ponents, which constitute the router, are the input buffers,route computation logic, VC allocator, switch allocator andcrossbar switch.

Fig. 3(a) shows the base router pipeline on which wewill progressively add state-of-the-art router microarchitec-tural features. Since on-chip designs need to adhere totight area budgets and low router footprints, we assumeflit-level1 buffering and credit-based VC flow control [9] atevery router, as opposed to packet-level buffering. A headflit, on arriving at an input port, first gets decoded andbuffered according to its input VC in the buffer write (BW)pipeline stage. In the next stage, the routing logic performsroute computation (RC) to figure out the output port for thepacket. The header then arbitrates for a VC correspondingto its output port in the VC allocation (VA) stage. Uponsuccessful allocation of a VC, it proceeds to the switch al-location (SA) stage where it arbitrates for the switch inputand output ports. On winning the output port, the flit thenproceeds to the switch traversal (ST) stage, where it tra-verses the crossbar. This is followed by link traversal (LT)to travel to the next node. Body and tail flits follow a sim-ilar pipeline except that they do not go through RC andVA stages, instead inheriting the VC allocated by the headflit. The tail flit, on leaving the router, deallocates the VCreserved by the header.

To remove the serialization delay due to routing, priorwork has proposed lookahead routing (LA) [15] where theroute of a packet is determined one hop in advance, therebyenabling flits to compete for VCs immediately after the BWstage. Fig. 3(b) shows the router pipeline with lookahead

1A flit is part of a packet, and the smallest unit of flow control. A

packet consists of a head flit, followed by body flits, and ends with atail flit.

routing. Pipeline bypassing (BY) is another technique thatis commonly used to further shorten the router critical pathby allowing a flit to speculatively enter the ST stage if thereare no flits ahead of it in the input buffer queue. Fig. 3(c)shows the pipeline where a flit goes through a single stageof switch setup, during which the crossbar is set up for flittraversal in the next cycle while simultaneously allocating afree VC corresponding to the desired output port, followedby ST and LT. The allocation is aborted upon a port conflict.When the router ports are busy, thereby disallowing pipelinebypassing, aggressive speculation (SP) can be used to cutdown the critical path [12,13,16]. Fig. 3(d) shows the routerpipeline where VA and SA take place in parallel in the samecycle. If the speculation succeeds, the flit directly entersthe ST pipestage. However, when speculation fails, the flitneeds to go through these pipestages again depending onwhere the speculation failed.

The baseline router used in this study incorporates all theabove-mentioned state-of-the-art techniques.Router microarchitectural components: We next de-tail the microarchitectures we assumed for the different com-ponents within the baseline router. In order to make thedesign area- and energy-efficient, we assume single-portedbuffers and a single shared port into the crossbar from eachinput. Separable VC and switch allocators modeled closelyafter the designs in [13] are assumed as they are fast and oflow complexity, while still providing a reasonable through-put, making them suitable for the high clock frequencies andtight area budgets of on-chip networks. We also incorporaterouter microarchitectural optimizations for energy into ourbaseline design. First, write-through input buffers [17] areused, which save buffer read energy whenever a flit is ableto directly bypass to the ST stage. Second, we adopt thecut-through crossbar design [17], which sacrifices the full con-nectivity provided by a matrix crossbar to reduce the areaand energy overhead. In this work, we assume dimension-ordered routing for which a cut-through design shows noperformance degradation due to reduced connectivity.

2.2 Ideal interconnection fabricWe next discuss the characteristics of the ideal intercon-

nection fabric.Ideal latency: A network with an ideal latency is one inwhich data travel on dedicated pipelined wires directly con-necting their source and destination. The packet latency,Tideal, in such a network is governed only by the averagewire length D (the Manhattan distance) between the sourceand destination, packet size L, channel bandwidth b, andpropagation velocity v:

Tideal = D/v + L/b (1)

The first term corresponds to the time of flight which isthe time spent traversing the interconnect, and the second tothe serialization latency which is the time taken by a packetof length L to cross a channel with bandwidth b.

In a packet-switched network, the sharing and multiplex-ing of links between multiple source-destination flows resultin increased packet transmission latency T , which is definedas the time elapsed between the first flit of the packet beinginjected at the source node to the last flit being ejected atthe destination:

T = D/v + L/b + H · Trouter + Tc (2)where H is the average hop-count, Trouter the delay througha single router, and Tc the delay due to contention [10]. Thethird term in Equation (2) corresponds to the time spent inthe router coordinating the multiplexing of packets, whilethe fourth term is the contention delay spent waiting forresources. While a packet-switched network adds routerpipeline and contention delay, the ideal network, however, inassuming that every tile is interconnected with every othertile, requires an enormous amount of global interconnectthat has a detrimental effect on overall chip dimension, andthus D. In such a scenario, the wiring along the networkbisection plane, which has the maximum number of wiretracks, forms the limiting factor. The chip edge length re-quired to accommodate the total bisection wiring can becalculated as

50 5251 555453 56

20 2221 252423 26

30 3231 353433 36

40 4241 454443 46

10 1211 151413 16

00 0201 050403 06

60 6261 656463 66

(a) An EVC network

50 5251 555453 56

20 2221 252423 26

30 3231 353433 36

40 4241 454443 46

10 1211 151413 16

00 0201 050403 06

60 6261 656463 66

(b) VCs acquired from nodes 01 to 56Figure 1: EVC network (solid lines are NVCs, dotted ones are EVCs)

RouteComputation

VC 1

VC n

VC 2

VC 1

VC n

VC 2

VCAllocator

SwitchAllocator

Output 0

Output 4

Input 0

Input 4

Crossbar switch

Input buffers

Input buffers

(a) Baseline router microarchitecture

North

West

South

EastCrossBar

128

128

128

128

Local Control128

(b) Five-port router layout (input buffersare housed within the respective input ports,while VC, SA and control logic are groupedwithin Control)

Figure 2: Baseline router microarchitecture, design and layout

Ledge = 2 ·N

2· N

2· Wpitch · cwidth (3)

where Ledge is the chip edge length, N the number of tiles,Wpitch the wire pitch and cwidth the channel width.

We use the data from [18] to calculate the maximumwire length which can be driven in a cycle assuming delay-optimal insertion of repeaters and flip-flops. Assuming uni-form placement of tiles, the average wire length D can thenbe derived as:

D = H · Ledge√N

(4)

It should be pointed out that such an ideal interconnec-tion fabric, where each node has a dedicated interconnect toevery other node, is not feasible in practice: a 7×7 networkwill require a chip size of 4760mm2 (Ledge=69mm), way be-yond the ITRS projection [1] of a chip size of 310mm2 forhigh-performance chips.Ideal energy: The energy consumption Eideal of a packetin an ideal network is given by

Eideal = L/b · D · Pwire (5)where D is the Manhattan distance between source and dis-tance and Pwire the interconnect transmission power perunit length.

Again, multiplexing of packets in networks leads to addi-tional energy consumption, with the energy E required totransmit a packet given by

E = L/b · (D · Pwire + H · Prouter) (6)where Prouter is the average router power. Prouter is com-posed of buffer read/write power, power spent in arbitrat-ing for VCs and switch ports, and the crossbar traversalpower [19].Ideal throughput: Network throughput, which is definedas the data rate in bits per second that the network can ac-cept per input port before saturation, is largely determinedby the topology, flow control and routing mechanism. Givena particular topology, the ideal-throughput network is onewhich employs perfect flow control and routing to balancethe network traffic over alternative paths while leaving noidle cycles on the bottleneck channels.

To study the ideal throughput, we simulated a network

with unconstrained buffer capacity, unlimited VCs2, andperfect switch allocation. Since this paper targets microar-chitectural optimizations, we keep the topology and routingalgorithm consistent between the baseline, ideal and pro-posed designs.

2.3 Existing gapHere, we compare the state-of-the-art baseline design and

ideal network in terms of latency, energy and throughput,noting the significant gap that still exists and motivatingthe need for further router microarchitectural innovationsto bridge this gap.Simulation methodology: Table 1 in Section 5.1 showsthe network parameters assumed for this study. Microarchi-tectural parameters, such as the number of VCs and buffers,were experimentally obtained by ensuring that they led tothe best performance for the specific traffic pattern. Detailsof the simulation infrastructure are discussed in Section 5.Design and pipeline sizing methodology: Fig. 2(b)shows the router layout we assumed throughout this paper.The buffers are laid out on the two sides of the router withthe crossbar/switch in the center. In this layout, the heightof the router is determined by the buffer height while thewidth of the router is determined by the wire pitch.

For each microarchitectural component, we design the cir-cuit schematics to estimate the pipeline delay of each logi-cal pipeline stage and size the router pipeline appropriately.The clock rate (3GHz frequency) of the router, correspond-ing to a cycle time of 18 fanout-of-four (FO4) gate delays(excluding clock set up), was chosen based on being able toswitch through the crossbar in a single cycle. The BW stageis used to write the flit into the buffers as well as set up allcontrol signals in preparation for the next pipe stage (VA).It is also in the BW stage that we pre-compute the route(RC). The VA stage presents the longest critical path in ourdesign and our modeling ascertains that the entire VA stagecan be accommodated within a single 18FO4 cycle. Thephysical pipeline thus corresponds to that shown in Fig. 3.Results: Fig. 4(a) compares the latency of the ideal net-work with that of the baseline network as a function of in-creasing network traffic. The latency of the ideal network,

2In actual simulations, this is mimicked by having very large num-

bers of buffers and VCs and verifying that a further increase in bothresources does not lead to better performance.

STSAVARC LT

SABubbleBubbleBW

BW

LTST

Headflit

Body/tailflit

(a) Base router pipeline (BASE)

VABWRC

BW

LTSTSA

SA LTSTBubble

Headflit

Body/tailflit

(b) Lookahead routing (BASE+LA)

Setup

Setup

LTST

LTST

Headflit

Body/tailflit

(c) Bypass pipeline (BASE+LA+BY)

BW

LTSTVASA

BW

LTSA ST

Headflit

Body/tailflit

(d) Speculative pipeline (BASE+LA+BY+SP)

Figure 3: Router pipeline [BW: Buffer Write, RC: Route Computation, VA: Virtual Channel Allocation,SA: Switch Allocation, ST: Switch Traversal, LT: Link Traversal]which uses dedicated links between any pair of tiles, re-mains constant irrespective of the network load. For thebaseline, we show the impact on network latency as we pro-gressively incorporate lookahead routing, pipeline bypassingand aggressive speculation. Under very low network load(< 15% capacity), when most router ports are free, bypass-ing is able to significantly reduce packet latencies. Usingspeculation along with bypassing lowers the latency furtherwith increasing load (< 30% capacity). However, as trafficincreases, contention for network resources becomes higher,which leads to a higher probability of failed speculations andresults in increasing latency until the network reaches thesaturation point. Hence, it can be seen that router overheadleads to a significant latency gap between packet-switchednetworks and the ideal network.

Fig. 4(a) also highlights the significant throughput gapbetween the baseline and ideal-throughput network, withbaseline’s separable allocators only achieving 70% of the ca-pacity of the ideal-throughput fabric.

Fig. 4(b) shows the gap between the energy consumptionof the ideal network compared to the baseline network. De-spite the baseline incorporating energy-efficient microarchi-tectural features, there still exists a substantial energy gapdue to the additional buffering, switching and arbitrationenergy consumed in a router, which increases with networkload until saturation is reached.

3. EXPRESS VIRTUAL CHANNELS:TOWARDS THE IDEALINTERCONNECTION FABRIC

As explained in Section 2, the sharing and multiplexing ofdata on links in interconnection networks come at the costof complex routers which contribute additional overhead interms of packet latency and energy due to the router pipelineand resource contention while degrade throughput as a re-sult of imperfect allocation of the limited bandwidth.

By virtually bypassing routers, EVCs, as proposed in thiswork, remove Trouter and Erouter at bypass hops, thus low-ering energy/delay. As EVCs do not participate in arbitra-tions, they also lower Tc and improve allocation efficiency,thereby pushing throughput. Moreover, since the expresspaths created by EVCs are virtual as opposed to physicallinks, numerous EVCs of varying lengths can be used toconnect nodes without the proportionate wiring area over-head. This allows packets to use EVCs which are tailoredto their specific route and, hence, facilitates dynamic adap-tation to different traffic patterns which further improvesnetwork performance and energy.

In the rest of this section, we explain the details of EVCswhile Section 4 delves into the detailed microarchitecturaldesign of the EVC-based router.

3.1 Static EVCsHere, we first present the details of EVCs using a static

design which uses express paths of uniform lengths. Allnodes in a static EVC network are distinguished as eitheran EVC source/sink node or a bypass node. A node is anEVC source/sink along a specific dimension if an EVC alongthat dimension originates/terminates at that node. Bypass

nodes, on the other hand, do not act as sources and sinksof EVCs and are the ones which are virtually bypassed bypackets traveling on EVCs. For example, in Fig. 1(a), node00 is an EVC source/sink for both the X and Y dimensionswhile node 13 (04) is an EVC source/sink node along the X(Y ) dimension. Nodes 01 and 02 (10 and 20) are examplesof bypass nodes along the X (Y ) dimension.

The entire set of VCs is divided into two types:• NVCs: these are VCs which are allocated just like in

traditional VC flow control [9] and are responsible forcarrying a packet through a single hop.

• k-hop EVCs: these are VCs which carry the packetthrough k consecutive hops (where k is the fixed lengthof the EVC and is uniform throughout the network).

Bypass nodes only support the allocation of NVCs, notEVCs, with EVCs bypassing their router pipelines. There-fore, packets can acquire EVCs along a particular dimensiononly at EVC source/sink nodes. When a packet traveling onan EVC reaches a bypass node, it bypasses the entire routerpipeline, skipping VC allocation as it continues on the sameEVC it currently holds. It does not need to go throughswitch allocation as EVCs are prioritized over NVCs andare thus able to gain automatic passage through the switchwithout any contention. In other words, a packet travel-ing on a k-hop EVC traverses the next k − 1 nodes withouthaving to go through the router pipeline. Thus, a packettries to traverse as many EVCs as possible along its routefrom the source to destination. NVCs are only used to reachan EVC source/sink in order to hop onto an EVC or whenthe hop-count in a dimension is less than k, the EVC length.Fig. 1(b) shown earlier depicts the VCs acquired by a packettraveling from node 01 to node 56 using deterministic XYrouting. Here, the packet travels on NVCs from node 01 to03, EVCs from nodes 03 to 06 and then 06 to 36, and finallyNVCs from node 36 to 56. While connecting nodes using thevirtual express lanes provided by EVCs, it should be notedthat EVCs are restricted to connect nodes only along a sin-gle dimension and cannot be allowed to turn. Thus, packetsare required to go through the router pipeline and changeVCs when turning to a different dimension. This restrictionis required to avoid conflicts between multiple EVC paths.Impact on latency: Fig. 5(a) shows the non-express routerpipeline for head, body and tail flits. Flits go through thispipeline at an EVC source/sink node or when they arriveat a bypass node on an NVC. The number of logical stagesin this pipeline is identical to that in the baseline router(BASE+LA+BY+SP) described in Section 2. Fig. 5(b)shows the express router pipeline. A flit goes through thispipeline whenever it bypasses a node virtually, which hap-pens when it arrives at a bypass node on an EVC. As a flittraveling on an EVC will continue on that EVC, there is noneed to go through VA. It can also skip SA as EVCs aregranted higher priority over NVCs and are automaticallygranted switch passage. Since no allocation is needed, anEVC flit can bypass BW and head directly to ST, followedby LT, to the next node at the end of which the flit getslatched. This pipeline, however, requires the switch to be setup a priori. We do this by sending a lookahead signal overa single-bit wire, which goes one cycle ahead to set up the

0

20

40

60

0.1 0.3 0.5 0.7 0.9

Injected load (fraction of capacity)

Late

ncy

(cyc

les)

ideal latency ideal throughputBASE+LA BASE+LA+BYBASE+LA+BY+SP

Throughput gap

(a) Latency-throughput gap

0

2

4

6

0 0.2 0.4 0.6 0.8


Net

wor

k en

ergy

(m

J)

baseline ideal

(b) Energy gapFigure 4: Existing gap between the state-of-the-art baseline and ideal interconnection fabric

switch at every intermediate hop which is bypassed, beforethe actual flit starts traveling on an EVC. Aggressive tailor-ing of the switch for EVCs can further shorten the expresspipeline by removing the ST stage and allowing EVC flits tobypass the crossbar as well. Fig. 5(c) shows this pipeline. Ascan be seen, the pipeline is now reduced to just link traver-sal: approaching that of the ideal interconnect. Note thatEVCs enable bypassing of the router pipeline at all levelsof network loading, unlike prior techniques like bypassing orspeculation which are effective only under low network load.Impact on energy: Unlike speculation-based techniques,EVCs also lead to a significant reduction in network energyconsumption. They do so by targeting the per-hop routerenergy Erouter, which is given as:Erouter = Ebuffer write+Ebuffer read+Evc arb+Esw arb+Exb

(7)where Ebuffer write and Ebuffer read are the buffer write andread energy, Evc arb is the VC arbitration energy, Esw arb isthe switch arbitration energy, and Exb is the energy requiredto traverse the crossbar switch.

While traveling on an EVC, a packet is able to bypassthe router pipeline of intermediate nodes, without the needfor getting buffered or having to arbitrate for a VC or theswitch port. This in effect saves Ebuffer write, Ebuffer read,Evc arb and Esw arb, thereby significantly reducing Erouterand approaching ideal energy. Moreover, since bypass nodessupport only NVCs and do not buffer EVC flits, the to-tal amount of buffering is reduced, which leads to a cor-responding reduction in energy consumption and area. Theaggressive express pipeline removes Exb as well, though wireenergy Ewire increases slightly because of higher load (seeSection 4).Impact on throughput: Given a particular topology androuting strategy, network throughput is largely determinedby the flow control mechanism. A perfect flow control isone which makes efficient use of network resources, leavingno idle cycles on the bottleneck channels. Using virtual ex-press lanes, which effectively act as dedicated wires betweenpairs of nodes, EVC-based flow control is able to create par-tial communication flows in the network, thereby improvingresource utilization and reducing contention Tc at individualrouters. Thus, packets spend less time waiting for resourcesat each router which lowers the average queuing delay, al-lowing the network to push through more packets before sat-uration and hence approach ideal throughput.

3.2 Dynamic EVCsUsing static EVCs of a fixed uniform length to connect

source/sink nodes throughout the network, as discussed inthe previous section, results in a constrained and asymmet-ric design. Firstly, the classification of all nodes as bypassand source/sink leads to an asymmetry in the design. Pack-ets originating at bypass nodes are forced to first travel onNVCs before they can acquire an EVC at a source/sinknode. Moreover, static EVCs lead to a non-optimal usageof express paths when packet hop-counts do not match thestatic EVC length k. In this case, packets end up bypass-ing fewer nodes along their route. In other words, staticEVCs are biased towards traffic originating at source/sinknodes and with hop-count equal to or a multiple of the EVClength. Dynamic EVCs overcome these problems by: (a)

making every node in the network a source/sink, and (b)allowing EVCs of varying lengths to originate from a node.By making all nodes identical, dynamic EVCs lead to a sym-metric design ensuring fairness among nodes. Moreover, in-stead of having a fixed length, EVC lengths are allowed tovary between two hops upto a maximum of lmax hops. Thisimproves adaptivity by allowing packets to pick EVCs of ap-propriate lengths to match their route the best. Fig. 6(a)shows dynamic EVCs along a particular dimension with lmax= 3. As can be seen, each node acts as a source/sink ofEVCs of lengths two and three hops. Fig. 6(b) shows theVCs acquired by a packet traveling from node 01 to node56 using XY routing in a network with lmax = 3. Sinceall nodes are source/sink nodes, the packet can acquire anEVC at its source node 01, taking the longest possible EVC(three-hop) to reach node 04. It then takes a two-hop EVCto reach node 06, followed by traversing a three-hop EVC inthe Y dimension to reach node 36. Finally, the packet takesa two-hop EVC to reach its destination node 56. It can beseen that as compared to static EVCs (Fig. 1(b)), dynamicEVCs can adapt to the exact route of the packet, therebyallowing it to bypass more nodes and, hence, result in betterperformance and energy characteristics.

Similar to static EVCs, dynamic EVCs are restricted tobe along a single dimension to prevent conflicts. It shouldbe noted that overlapping of multiple EVC paths along thesame dimension is allowed since all overlapping EVCs usenetwork links in a sequentially ordered fashion. From anode’s perspective, it always prioritizes any EVC flits it seesover locally-buffered flits.

Implementation of dynamic EVCs requires partitioningthe entire set of VCs at any router port between NVCs andEVCs of lengths from two through lmax hops. Thus, unlikestatic EVCs, which partition all VCs between two bins ofNVCs and uniform-length EVCs, dynamic EVCs divide theVCs into a total of lmax bins. A simple scheme is to uni-formly partition all VCs by allocating an equal number ofthem to each bin.Routing flexibility using dynamic EVCs: The perfor-mance of dynamic EVCs can be further improved by al-lowing for flexibility in EVC traversals. While, normally, apacket tries to acquire the longest possible EVC along itspath in order to bypass the maximum number of nodes, thisscheme can be relaxed under certain scenarios when con-tention for EVCs of a particular length is high. For instance,consider a scenario where a packet has to travel p hops in aparticular dimension, where p ≤ lmax. In this case, if a p-hopEVC is not available but a smaller-length EVC is free, thepacket can choose to switch to a smaller-length EVC. Effec-tively, this implies that longer EVC traversals can be brokendown into a combination of shorter EVC/NVC traversals inorder to spread the traffic between all virtual paths and,hence, reduce contention.

In order to make effective use of the routing flexibilityin EVCs, the VC pool partitioning should be made non-uniform, with more VCs allocated to virtual paths of smallerlengths. This is because longer EVCs can only be used bypackets with larger hop-counts and will remain unutilizedif the traffic pattern has shorter distances to travel. Onthe other hand, by allocating more VCs to shorter paths,

BW

LTSTVASA

BW

LTSA ST

Headflit

Body/tailflit

(a) Non-express pipeline

ST

LTLatch

ST

LTLatch

Headflit

Body/tailflit

(b) Express pipeline

LTLatch

LTLatch

Headflit

Body/tailflit

(c) Aggressive express pipelineFigure 5: EVC router pipelines

00 0201 050403 06

(a) Dynamic EVCs along the X dimension(assuming lmax = 3)

50 5251 555453 56

20 2221 252423 26

30 3231 353433 36

40 4241 454443 46

10 1211 151413 16

00 0201 050403 06

60 6261 656463 66

(b) VCs acquired from nodes 01 to 56

Figure 6: Dynamic EVC network

long-distance traffic always has the option to switch to acombination of shorter EVCs if contention for longer EVCsis high.

3.3 EVC buffer managementBuffered flow control techniques require a mechanism to

manage buffers and communicate their availability betweenrouters. Since a k-hop EVC creates a virtual lane betweennodes which are k hops apart, it assumes buffer availabilityinformation to be communicated across k hops so that flitsare ensured of a free buffer at the downstream EVC sinknode. In this section, we present two schemes for buffermanagement in an EVC-based network.Statically-managed buffers: This scheme partitions thebuffer pool at a given router port between all VCs, stat-ically reserving a set of buffers for each VC. Credit-basedflow control [10] is used in which the upstream router main-tains a count of the number of free buffers available down-stream. This count is decremented each time a flit is sentout, thereby consuming a downstream buffer. On the otherhand, when a flit leaves the downstream node and frees itsassociated buffer, a credit is sent back upstream and thecorresponding free buffer count is incremented.

The number of buffers reserved for each VC is based onthe corresponding credit round-trip delay. Fig. 7(a) showsthe timeline for buffer management for NVCs. Using thetimeline, the NVC credit round-trip delay, Tcrt,n can be cal-culated as:

Tcrt,n = 1 + p + 1 + r = p + r + 2 (8)

Equation (8) assumes that it takes one cycle for the creditto travel upstream, p cycles for flit processing at the up-stream node, one cycle for the flit to travel downstream andr is the number of non-express router pipeline stages.

Fig. 7(b) shows a similar timeline for EVCs. Each k-hopEVC is allocated a minimum of Tcrt,e buffers:

Tcrt,e = c + p + 1 + 1 + 2(k − 1) + r = c + p + 2k + r (9)assuming it takes c cycles for the credit to propagate up-stream to the EVC source, p cycles for flit processing, onecycle for the lookahead signal, one cycle for the flit to tra-verse the link to reach the first node which is to be bypassed,2(k − 1) cycles for the flit to propagate to the downstreamEVC sink by bypassing k − 1 intermediate nodes (assum-ing the express router pipeline), and r cycles to traverse thenon-express router pipeline at the downstream EVC sink.As Tcrt,e is larger than Tcrt,n, EVCs require deeper buffersthan NVCs to cover the credit round-trip delay. Note thatTcrt,e can be reduced by using upper metal layers for creditsignaling, thereby making c < k. In this paper, however, tominimize overhead and to be consistent with the baseline,we kept credit signals on the same metal layer, thus c = k.Moreover, Tcrt,e also decreases when using the aggressiveexpress pipeline due to a shorter bypass pipeline.

While static buffer management is easy to implement, itis inefficient in allocating buffers in case of adversarial traf-

fic. Consider the case when the majority of network trafficis nearest-neighbor, i.e., each node is communicating onlywith its immediate neighbor. In this case, EVCs are neverused and the buffer space statically assigned to EVCs goesunutilized. Moreover, with Tcrt,e being directly proportionalto k, longer EVCs require more buffers. This overhead canbe large, especially in the case of dynamic EVCs where eachnode sources and sinks multiple EVCs of varying lengths.Dynamically-managed buffers: To reduce buffer over-head and improve utilization, dynamic buffer managementcan be used in which buffers at each router are shared be-tween all VCs. In this scheme, each router port maintains apool of free buffers which can be allocated to any incomingEVC or NVC flit. Flow control is performed using threshold-based management where a node sends a stop token to anupstream node to which it is connected (either virtuallythrough EVCs or physically through NVCs) when the num-ber of free buffers falls below a pre-calculated threshold,Thrk. On receiving this token, the upstream node stopssending flits to the corresponding downstream node. Con-versely, as downstream buffers are freed and their numberexceeds Thrk, a start token is sent upstream to signal re-start. To ensure freedom from deadlocks, one buffer slot isreserved for each VC. This ensures that a packet can makeprogress even if there are no buffers left in the free pool.

The value of Thrk is calculated based on the hop distancebetween communicating nodes, which for a virtual path de-pends on the corresponding EVC length k, and is given by:

Thrk = c + 2k − 1 (10)where c is the number of cycles taken by the token to prop-agate to the upstream EVC source while there can be amaximum of 2k−1 flits in-flight which have already left theEVC source and need to be ensured of a buffer downstream(assuming the express pipeline and one cycle for token pro-cessing). Again, the value of Thrk decreases when usingthe aggressive express pipeline because of a shorter bypasspipeline and correspondingly lower number of in-flight flitsand even though global wires can be used to reduce c, inthis work we assume c = k. For NVCs, the threshold canbe calculated by setting k to one. In case of dynamic EVCs,multiple threshold values are used corresponding to eachEVC length, with longer EVC paths being turned off firstfollowed by smaller EVCs, and finally NVCs, whereas thereverse happens as buffers become free.

To overcome the buffer overhead of static buffer manage-ment, we use dynamically-managed buffers in this work.3.4 Starvation avoidance

In any network, which pre-reserves bandwidth for specificmessage flows, a starvation scenario may arise when mes-sages traveling on a pre-established circuit block other mes-sages. Similarly, in the case of a network employing EVCs,the higher priority given to EVC flits can lead to a starva-tion scenario. More specifically, if a node along the pathof an EVC always has incoming EVC flits to service, flits

credit

flit

credit flit

Non

-ex

pres

s

pipe

line

proc

ess

credit

T crt

,n

t1

t2

t3

t4

t5

Node 0 Node 1

flitcredit

flit

credit flit

Non

-ex

pres

s

pipe

line

proc

ess

credit

T crt

,n

t1

t2

t3

t4

t5

Node 0 Node 1

flit

(a) NVC credit flow

credit

flit

flit

Express pipeline

process

credit

T crt

,e

Node 00 Node 01

flit

Node 02

flit

flit

Node 03

Non-express

credit

Express pipeline

pipeline

(b) EVC credit flow (for a three-hop EVC)

Figure 7: Credit round-trip delay with EVCsbuffered locally at the node may never get a chance to usethe physical channel.

A typical starvation scenario for static EVCs is depicted inFig. 8(a), where NVC flits buffered to go out at the east out-put port of node 03 are starved by the EVC (shown in bold).One simple algorithm to avoid such scenarios involves eachbypass node maintaining a count of the number of consec-utive cycles for which it has served an EVC. After servingEVC flits for n consecutive cycles, a bypass node sends aStarvation on token upstream to the EVC source node if ithas NVC flits which are getting starved. Upon receiving thistoken, the EVC source node stops sending EVC flits on thecorresponding link for the next p consecutive cycles. Hence,starved locally-buffered NVC flits can now be serviced. nand p are design parameters which can be set empirically.

In case of dynamic EVCs, starvation signaling needs tobe done across more than one EVC source as multiple EVCpaths may be bypassing a node. Fig. 8(b) shows a typicalstarvation scenario (for a network with lmax = 3), whereflits buffered to go out at the east output port of node 03get starved by the bypassing EVC traffic. EVC flows whichcan potentially lead to this starvation scenario are depictedin bold. The starvation avoidance algorithm used is similarto the one used for static EVCs with every node maintaininga count of the number of consecutive cycles for which it hasserved EVC flits. When this count exceeds a threshold nand if that node has locally-buffered flits waiting to use thephysical channel, it sends out a Starvation on token. Con-sidering the example in Fig. 8(b), when the count at node03 exceeds n, it sends such a token to the neighboring node02 which stops sending locally-buffered EVC flits (if any)along its east output port for the next p cycles. Node 02then forwards the Starvation on token to its next neighborwhich then stops sending EVC flits on its east output portas well. In general, forwarding of Starvation on tokens isdone for lmax − 1 hops to cover all potential bypassing EVCpaths. Moreover, to reduce the wiring overhead of starvationsignaling, if more than one node is getting starved simulta-neously, Starvation on tokens can be merged. For instance,if node 02 in the above example receives a Starvation ontoken from node 03 and detects starvation for its local flitsat the same time, it can merge its own Starvation on tokenwith the existing token (instead of creating a new token),and send it for forwarding to the next lmax − 1 hops (uptonode 00).

Previously proposed deadlock-avoidance techniques canbe readily used with EVCs. For instance, escape routes [10]can be created using only NVCs (similar to traditional VCflow control). Even though NVCs have a lower priority com-pared to EVCs, they are guaranteed to make progress usingthe starvation-avoidance algorithm of the EVC design.

4. EVC ROUTER MICROARCHITECTUREDESIGN

Fig. 9 shows the microarchitecture of a router in an EVC-based design. The differences from a generic router areshaded. For a bypass node in the static EVC design, the

changes include the VC allocator, which now only handlesNVCs, and the EVC latch which holds the flit arriving onan EVC. The crossbar switch can remain unchanged, or canalso be aggressively designed to bypass ST as well for EVCflits. On the other hand, for a source/sink node in the staticEVC design, the entire set of VCs at each input port is di-vided between EVCs and NVCs, each with a separate allo-cator. The EVC latch is not required as flits are not allowedto bypass source/sink nodes. In case of dynamic EVCs,nodes are not classified as bypass and source/sink. All nodeshave the same microarchitecture, with the differences froma generic router including the EVC latch, division of VCsat each port into EVCs and NVCs (with a separate allo-cator for each) while the crossbar switch can again remainunchanged or aggressively designed. Since connections be-tween non-adjacent nodes in an EVC design are virtual asopposed to physical, it can be seen that an EVC design doesnot require any additional router ports as compared to thebaseline (even though each node may be connected to manyother nodes using multiple express connections), thereby im-posing no siginificant increase in the area/energy overheadof individual routers.

We next detail the design of each microarchitectural com-ponent of the EVC-based router. Note that the same designand pipeline sizing methodology we used for the baselinerouter is applied to that of the EVC router.Virtual channel allocator: In order to prevent head-of-line blocking, two separate sets of VC allocators are used atevery node in a dynamic EVC design and source/sink nodesin a static EVC design: one which allocates EVCs and theother NVCs. Depending on the output port and number ofhops left in the packet’s next dimension, the packet eitherplaces a request to the NVC allocator (if the number ofhops left in the next dimension is less than the uniformEVC length for static EVCs or the smallest EVC length fordynamic EVCs) or otherwise to the EVC allocator. Bypassnodes in a static EVC design, on the other hand, allocateonly NVCs and, hence, have only one VC allocator.Switch design: For the express pipeline in Fig. 5(b), nomodifications need to be made to the crossbar switch design,since the EVC latch in the input port is multiplexed with theNVC input buffers, sharing a single input port to the cross-bar. However, to achieve the aggressive express pipeline inFig. 5(c), where EVC flits bypass the switch altogether, theEVC latch needs to be physically located near the center ofthe router. This EVC latch acts as a staging latch, breakingthe flit’s path through the network and effectively bypassingthe entire data-path of the router.Buffer management circuit: Fig. 10 shows the block dia-gram for dynamically-managed buffers. At each input port,a router maintains a pool of free flit buffers. When a new flitarrives, it is assigned the buffer slot at the head of this poolin the BW pipeline stage along with storing the assignedbuffer slot in the packet control block. The function of thepacket control block is to store the buffer slot pointers fortracking the buffers assigned to all flits corresponding to dif-ferent input VCs. This pointer information is used to read

00 0201 050403 06

starved

(a) Static EVCs

00 0201 050403 06

starved

(b) Dynamic EVCsFigure 8: Starvation scenario

RouteComputation

NVC Allocator

Output 0

Output 4

Input 0

Input 4

Crossbar switch

NVC

NVC

EVClatch

NVC

NVC

EVClatch

Switch Allocator

(a) Bypass node in static EVC net-work

RouteComputation

NVC Allocator

Output 0

Output 4

Input 0

Input 4

Crossbar switch

NVC

EVC

NVC

EVC

EVC Allocator

Switch Allocator

(b) Source/sink node in static EVCnetwork

RouteComputation

NVC Allocator

Output 0

Output 4

Input 0

Input 4

Crossbar switch

NVC

EVC

EVClatch

NVC

EVC

EVClatch

EVC Allocator

Switch Allocator

(c) Dynamic EVC network node

Figure 9: EVC router microarchitecture

PayloadBuffer Write

PayloadBuffer Read

Incomingflit Outgoing

flit

BWstage

STstage

Flit buffer free listAllocatefree buffer

Outgoingflit readAdd to

free list

Storepayloadpointer

VCi blockinfo

F0 F4F3F2F1

VCj blockinfo

F4F3F2F1F0

Control block

Figure 10: Dynamic buffer management circuitout the flit from the buffer pool before it traverses the switchin the ST stage, after which the buffer slot is returned to thefree pool. Since the pointer value can be prefetched beforethe ST stage begins, it does not add to the critical path.Apart from this, the buffer slot occupied by the header ofevery packet is marked as reserved for that packet and is notadded to the free pool until the tail of that packet leaves therouter. This is done to prevent deadlocks and ensure for-ward progress within each packet.Reverse signaling overhead: As compared to the base-line network, an EVC-based design requires extra reversewiring between nodes which are virtually connected for flowcontrol and starvation signaling. This overhead can be bro-ken into the following categories:VC signaling: These wires are used to signal availability ofVCs across nodes. For a baseline design with v VCs perport, the number of wires at any cross-section required forVC signaling along a particular direction, Wvc signaling , isgiven by:

Wvc signaling = dlog2ve (11)On the other hand, the corresponding Wvc signaling for a

static EVC design with v VCs partitioned into n NVCs ande EVCs, is given by:

Wvc signaling = dlog2ne + d(log2 e)e (12)For a dynamic EVC network with a maximum EVC length

of lmax, communication of VC availability has to be doneacross lmax nodes on either side. In this case, Wvc signalingis given by:

Wvc signaling = dlog2ne +l

(log2 m).(lmax).(lmax−1)

2

m

m = e/(lmax − 1)(13)

where m is the number of VCs for EVCs of a particularlength (assuming all e EVCs are uniformly partitioned amongEVCs with a length of two through lmax hops).Buffer threshold signaling: To signal buffer availability us-ing on/off threshold tokens, the baseline network requiresa single wire in each direction connecting adjacent nodeswhile a static EVC design requires two wires (one connect-ing adjacent nodes and the other connecting to the nodewith which a fixed-length EVC connection exists). On theother hand, in a dynamic EVC design, each node needs tocommunicate buffer availability to lmax neighbors. We use1-hot decoded wires for this purpose, giving a wiring over-head Wthr signaling in a particular direction of:

Wthr signaling = lmax (14)Starvation signaling: Again, while a single wire betweenany pair of virtually connected nodes is sufficient for star-vation signaling in static EVCs and dynamic EVCs withlmax = 2, in general a dynamic EVC network requires star-vation tokens to be communicated to lmax −1 neighbors (asexplained in Section 3.4) giving a starvation wiring overheadWstarvation of:

Wstarvation = dlog2(lmax − 1)e (15)Hence, it can be seen that while the baseline and staticEVCs have comparable constant reverse signaling overhead,this overhead for dynamic EVCs is dependent on lmax, withWvc signaling increasing quadratically while Wthr signalingand Wstarvation having a linear and logarithmic dependenceon lmax, respectively. This wiring overhead as a percentageof the forward wiring (assuming 128-bit wide channels andeight VCs per port which are uniformly partitioned betweenall VC bins) is 3% (4 wires) for the baseline, 5% (7 wires)for the static EVC design, 7% (9 wires) for a dynamic EVCdesign with lmax = 2, 14% (18 wires) for lmax = 3 and20% (26 wires) for lmax = 4. Hence, it can be seen thatthe reverse wiring for a dynamic EVC network with lmax= 2 is comparable to the baseline and static EVCs and is asmall fraction of the forward flit-wide wiring. This overhead,however, increases slowly with higher values of lmax.

5. EVALUATIONIn this section, we present a detailed evaluation of EVCs,

exploring its design space while comparing it against thestate-of-the-art baseline design (BASE+LA+BY+SP) andideal interconnect yardsticks described in Section 2.

5.1 Simulation infrastructureTo model network performance, we use a cycle-accurate

micro-architecture simulator. The model may be stimu-lated via a) a synthetic traffic generator mode or b) in trace

mode using traffic traces. The network models all majorcomponents of the router pipeline at clock granularity, viz.,buffer management, routing algorithm, VC and switch allo-cation and flow control between routers, with each compo-nent’s circuit schematics designed and pipeline stages sizedvia detailed critical path analysis (see Section 2). We useOrion [19], an architecture-level network energy model, toevaluate the energy consumption of routers and links. Orionincludes detailed dynamic and leakage energy models [20] forall major router microarchitecture components, includinginput buffers, crossbar, arbitration logic as well as networklinks.

Table 2 lists the EVC-specific parameters used in thisstudy, while Table 1 lists the general process parametersand that of the baseline network. The entire set of VCs inan EVC network is partitioned into NVCs and EVCs. Forstatic EVCs, the entire EVC set acts as a contiguous poolwhereas they are further partitioned into lmax − 1 bins incase of dynamic EVCs (Section 3.2). For better utilization ofbuffers, we use a dynamically-managed design (Section 3.3),with the entire buffer set at any router port shared among allVCs. The buffer size is kept the same as that in the baselinedesign. Unless otherwise mentioned, all EVC experimentsuse the aggressive EVC pipeline with an EVC length of twohops for static EVCs. To keep the reverse wiring overheadcomparable to the baseline, an lmax of two hops is also usedfor dynamic EVCs.

Evaluation using synthetic traffic uses packets which uni-formly consist of two sizes: single-flit short packets and longpackets consisting of five flits. Each simulation run for a syn-thetic traffic is 1 million cycles long. For all runs, a warm-upperiod of 100k cycles is assumed. Network saturation is de-fined as the point at which packet latency is three times thezero-load latency.

The SPLASH-2 [14] traces were gathered by running thebenchmarks on Bochs [21], a multiprocessor simulator withan embedded Linux 2.4 kernel. Each benchmark was runin Bochs with N concurrent threads, where N is the sizeof the chip/network, and the memory trace captured. Thismemory trace is then applied to a memory system simula-tor that models the classic MSI (Modified, Shared, Invalid)directory-based cache coherence protocol, with the home di-rectory nodes statically assigned based on the least signifi-cant bits of the tag, distributed across all processors in theentire chip. Each processor node has a two-level cache (2MBL2 cache per node) that interfaces with a network router and4GB off-chip main memory. Access latency to the L2 cacheis derived from Cacti to be six cycles, whereas off-chip mainmemory access delay is assumed to be 200 cycles.

5.2 Uniform random trafficUniform random traffic assumes each node uniformly in-

jects packets into the network with randomly distributeddestinations. This highlights the best-case throughput forthe baseline network. Here, we use uniform traffic to presenta comparison of EVCs against the baseline. In all results,percentage improvements are reported with respect to thebaseline.Performance evaluation: Fig. 11(a) plots flit latencies foruniform random traffic for a 7×7 network as a function ofthe injected load. EVC lengths are assumed to be two hops.Clearly, the EVC-based design outperforms the baseline interms of throughput as well as latency under all levels of net-work traffic. The latency reduction of static EVCs as com-pared to the baseline is 29.2% before the baseline saturateswhereas the corresponding reduction due to dynamic EVCsis 44.7%. Moreover, dynamic EVCs also lead to a significantimprovement in throughput with the network saturating ataround 82% of network capacity.

It can be seen that when the network load is low, bothEVCs and baseline perform well as packets are able to usepipeline bypassing to shorten the critical path. However, asthe load increases, the latency gap between the baseline andEVCs widens. Although the baseline relies on aggressivespeculation to reduce delay, such techniques fail more oftenand show a diminishing advantage with increasing trafficand contention. EVCs, on the other hand, are able to sus-

tain low latencies with increasing traffic by creating virtualdedicated wires in the network, allowing packets to bypassintermediate nodes along their path (and hence reduce la-tency) in a completely non-speculative manner. Moreover,EVCs lower the average per-hop contention seen by a packet,thereby pushing throughput. In trying to analyze this fur-ther, we see that dynamic EVCs perform better than staticEVCs by increasing the usage of the virtual lanes created byEVCs and, hence, increasing the average number of nodesa packet bypasses, the latency reduction of dynamic EVCsover the static EVC design being around 67% before thestatic design saturates. It can be seen that using EVCs, theno-load latency is reduced to 14.5 cycles (approaching the11-cycle ideal interconnect delay).Energy evaluation: Fig. 11(b) shows the normalized routerenergy consumption at 70% capacity (before the baselinesaturates) for a 7×7 mesh network. As can be seen, EVCsare able to significantly reduce overall router energy. Thisreduction is 21% for static EVCs and 24.5% for dynamicEVCs as compared to the baseline. This is mainly due toa reduction in buffer energy [by 25% (30%) for static (dy-namic) EVCs] and crossbar energy [by 29% (33%) for static(dynamic) EVCs] due to bypassing of intermediate nodeswhen using EVCs. Fig. 11(c) highlights this fact by com-paring the energy of different router components. Arbitra-tion energy (which is mostly dominated by control logic)was found to be insignificant and is omitted from the fig-ure. When compared with a baseline without the straight-through buffers and cut-through crossbar energy optimiza-tions discussed in Section 2.1, the energy reduction usingpower-optimized dynamic EVCs increases to 53.4%.

It should be noted that our default parameters assumeequal network buffer capacity for both EVCs and the base-line. Although EVCs achieve a much higher throughputusing this configuration, the designer can make an energy-performance trade-off by reducing the buffer size for EVCsand, hence, lowering both dynamic and leakage energy atthe expense of some loss in throughput.

5.3 Impact of different express pipelinesWhen using the non-aggressive express pipeline, where by-

passing EVC flits need to go through switch traversal, im-provements in packet energy and delay diminish by a smallamount. Fig. 12 plots packet latency as a function of net-work load for a 7×7 mesh, with EVCs using a non-aggressivepipeline and assuming uniform random traffic. In case ofdynamic EVCs, the non-aggressive design leads to a 11.6%higher latency near saturation as compared to the aggres-sive pipeline. The router energy consumption goes up by8% mainly due to an increase in crossbar energy.

5.4 Impact on network contentionBy allowing packets to bypass nodes virtually, EVCs ease

contention for resources in the network. To study this effect,Fig. 13 compares the average contention delay, or the timewhich packets spend waiting for network resources, near sat-uration for the different designs in a 7×7 network. It can beseen that EVCs significantly reduce the level of contentionin the network with the contention delay reducing by 28.5%(47%) for static (dynamic) EVCs. This is because EVC flitsdo not need to win switch ports and VCs while bypassing anode, as they are prioritized to directly use the switch port,which is equivalent to a 100% resource allocation efficiencywhile bypassing. This reduction in contention for networkresources translates directly to a higher throughput.

5.5 Effect of a larger networkTo study the scalability of EVCs, we consider a 10×10

network with longer EVC lengths of three hops for staticEVCs and lmax = 3 for dynamic EVCs (assuming uniformVC partitioning between EVCs of different lengths and theaggressive express pipeline). Fig. 14(a) plots packet latencyas a function of network traffic (assuming uniform randomtraffic) for this network. As can be seen, EVCs continue toshow a considerable performance gain as compared to thebaseline, with the reduction in latency being 34.4% for staticEVCs and 52.8% for dynamic EVCs just before the baseline

Table 1: Baseline process and network parametersTechnology 65 nm

Vdd 1.1 VVthreshold 0.17 VFrequency 3 GHzTopology 7-ary 2-meshRouting Dimension-ordered (DOR)Traffic Uniform random

Number of router ports 5VCs per port 8

Buffers per port 24Flit size/channel width (cwidth) 128 bits

Link length 1 mmWire pitch (Wpitch) 0.45µm

Table 2: EVC-specific parametersEVC pipeline Aggressive express pipeline

Buffer management dynamicBuffers per port 24

Static EVC-specific parametersEVC length 2 hops

NVCs per port 4EVCs per port 4

Dynamic EVC-specific parameterslmax 2

NVCs per port 2EVCs per bin 6

Starvation-avoidance parametersn 20p 3

10

20

30

40

50

60

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


Lat

ency

(cyc

les)

baseline static EVC dynamic EVC

(a) Network performance

0

0.2

0.4

0.6

0.8

1

No

rmal

ized

en

erg

y


(b) Router energy

0

0.2

0.4

0.6

0.8

1

buffer crossbar leakage

No

rmal

ized

en

erg

y


(c) Energy saving from differentrouter components

Figure 11: Uniform random traffic results

saturates. Moreover, the throughput for dynamic EVCs ap-proaches 88% of network capacity (23% improvement overthe baseline). This is mainly attributable to the fact thatwith longer and flexible EVC lengths, packets are able to by-pass more hops along their path. Fig. 14(b) shows the nor-malized router energy at 75% capacity (before the baselinesaturates). In this case, EVCs show a more pronounced re-duction in energy, with the average router energy lowered by23.5% for static EVCs and 38% for dynamic EVCs. Again,this is mainly attributable to a reduction in buffer energy[reduced by 29% (47%) using static (dynamic) EVCs] andcrossbar energy [reduced by 32% (50%) for static (dynamic)EVCs]. However, as explained before, a network with lmax= 3 incurs a slightly higher reverse wiring overhead as com-pared to the baseline.

5.6 Dynamic EVC design spaceIn this section, we explore the design space of dynamic

EVCs focusing in particular on the routing flexibility in us-ing EVCs and trade-offs related to EVC lengths.EVC routing flexibility: When using a dynamic EVC de-sign with lmax > 2, performance can be further increased byallowing flexibility in choosing EVCs, as explained in Sec-tion 3.2. This is especially true for non-uniform traffic, asshown in Fig. 15, which plots latency as a function of in-jected load for a 7×7 network, assuming the shuffle trafficpattern. lmax is chosen to be four hops, thereby allowingpackets to choose between EVCs of lengths two, three andfour hops when route flexibility is used. The partitioningof VCs is made non-uniform with a higher number of VCsgiven to smaller EVC lengths. Out of the total of eight VCsper port, two are assigned to NVCs whereas the rest aredivided as three VCs, two VCs and one VC for the two-,three- and four-hop EVC bins, respectively. As comparedto a design without route flexibility, a latency reduction of26% is seen near saturation along with a slight improvementin throughput. This is mainly attributable to a significantlowering in contention for VCs due to the spreading of pack-ets among different virtual paths owing to the flexibility inchoosing EVCs of different lengths.Maximum EVC length: We next explore the trade-offsrelated to the maximum EVC length lmax in a dynamic EVCnetwork. Assuming a 7×7 network using the aggressive ex-press pipeline and uniform random traffic, the no-load la-tency was found to be 14.5 cycles for lmax = 2, 13.6 cyclesfor lmax = 3 and 13.2 cycles for lmax = 4 with the satura-tion throughput as a fraction of capacity being 82%, 84%and 86% for lmax = 2, 3 and 4, respectively. It can be seenthat larger values of lmax lead to better performance with

the no-load latency of the lmax = 4 network approaching the11 cycles ideal interconnect latency. This is because longerEVCs allow packets to bypass more nodes along their paths(shown in Fig. 16(a)). However, as can be seen, the gain inperformance is not proportional to the increase in the num-ber of nodes bypassed. Moreover, increasing lmax comes atthe cost of a higher reverse wiring overhead, as explained inSection 4. In trying to study this further, we analyze the av-erage contention delay Tc, or the time which packets spendwaiting at intermediate routers near the saturation point(shown in Fig. 16(b)). It can be seen that whereas Tc reduceswith increasing lmax, leading to lower packet latencies, notall components of Tc decrease. Increasing lmax significantlylowers switch contention or the time spent waiting to winswitch ports at intermediate hops due to a higher number ofnodes bypassed and, hence, fewer switch allocations which apacket has to go through. However, as dynamic EVCs leadto a partitioning of the entire set of VCs at a port into lmaxbins, a larger lmax implies more bins with fewer VCs per bin(as the total number of VCs is kept constant). Hence, thecontention for VCs within a bin increases, leading to an in-crease in the delay due to VC contention, assuming routingflexibility among EVCs is not used. Moreover, the numberof EVC paths bypassing a node increases in proportion tolmax, leading to a higher probability of a flit waiting at arouter because of a bypassing EVC flit using the output link(shown as wait delay). As EVCs do not add extra physicalchannels, this wait delay is unavoidable because the samephysical channel is shared between waiting and bypassingflits. It should be noted, however, that we found Tc to bemainly dominated by switch and VC contention delays, withwait delay contributing only a small fraction.

Fig. 16 also highlights the effect of routing flexibility (ina network with lmax = 4) on Tc. Packets use routing flex-ibility to switch to smaller-length EVCs if longer ones arebusy, which causes the number of bypassed nodes to decreaseslightly with a corresponding increase in switch contention.However, the contention for VCs is reduced significantly dueto the distribution of packets among paths of different EVClengths based on the relative contention within each EVCbin, leading to a reduction in the overall contention delay.5.7 SPLASH results

In this section, we present evaluation results using bench-marks from the SPLASH-2 [14] suite for a 7×7 network,assuming an EVC length of two hops for static EVCs andlmax = 2 for dynamic EVCs (with the aggressive expresspipeline). Fig. 17(a) shows the comparison of normalizeddelay for each benchmark against the baseline. It can beseen that EVCs show a latency improvement over the base-

10

20

30

40

50

60

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Injected load (fraction of capacity)La

tenc

y (c

ycle

s)

baseline non-aggressive static EVC

aggressive static EVC non-aggressive dynamic EVC

aggressive dynamic EVC

Figure 12: Impact of different express pipelines on latency

0

0.2

0.4

0.6

0.8

1

Aver

age

cont

entio

n de

lay


Figure 13: Relative contention for networkresources

10

20

30

40

50

60

70

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


La

ten

cy

(c

yc

les

)


(a) 10x10 network performance

0

0.2

0.4

0.6

0.8

1

No

rma

lize

d e

ne

rgy


(b) 10x10 network energyFigure 14: EVCs in a larger network

5

15

25

35

45

55

65

0.03 0.05 0.07 0.09

Injected load (packets/node/cycle)

La

ten

cy

(c

yc

les

)

dynamic EVCs route-flexible dynamic EVCs

Figure 15: Impact of routing flexi-bility when using EVCs

0%

20%

40%

60%

80%

100%

lmax=2 lmax=3 lmax=4 lmax=4 withroute flexibility

avg. bypassed nodes avg. non-bypassed nodes

(a) Comparison of bypassednodes

0

0.2

0.4

0.6

0.8

1

1.2

1.4

contentiondelay

switch delay VC delay wait delay

No

rma

lize

d d

ela

y

lmax=2 lmax=3 lmax=4 lmax=4 with routing flexibility

(b) Packet contention delayanalysis

Figure 16: Analysis with varying lmax

0

0.2

0.4

0.6

0.8

1

fft lura

dix

wate

r-nsq

uare

d

wate

r-spa

tial

barn

es

ocea

n

raytra

ce

No

rma

lize

d la

ten

cy


(a) Normalized latency

0

0.2

0.4

0.6

0.8

1

fft lura

dix

wate

r-nsq

uare

d

wate

r-spa

tial

barn

es

ocea

n

raytra

ce

No

rma

lize

d e

ne

rgy


(b) Normalized energy

Figure 17: SPLASH resultsline for all benchmarks. The largest reductions were seen forwater-spatial and water-nsquared, with static EVCs reducinglatency by 84% (31.3%) for water-spatial (water-nsquared),the corresponding reductions for dynamic EVCs being 84%(40.3%). This is mainly attributable to high traffic levelsin these benchmarks where the speculative techniques usedby the baseline fail whereas EVCs, being non-speculative,show large gains. On the other hand, fft, lu, radix, barnesand ocean represent low to medium traffic with moderatehop-counts, leading to a latency reduction of 20.8%, 22.9%,34.6%, 19.4% and 35.8%, respectively, using the dynamicEVC design. Raytrace, on the other hand, represents verylow traffic but with a high hop-count. Hence, whereas thebaseline performs well due to pipeline bypassing and specu-lation, packet latencies are reduced further using EVCs (by28.2% using dynamic EVCs) due to their higher utilizationand the use of aggressive switch bypassing.

Fig. 17(b) shows the normalized router energy for the dif-ferent benchmarks. Again, EVCs show energy gains, thereduction using dynamic EVCs being around 9% on averagegoing upto 11% for water-spatial.

6. RELATED WORKTopological and routing techniques. Substantial priorwork has explored alternative topologies for reducing net-work latency through lowering of hop-count [10]. Higher-radix topologies lead to a lower hop-count, but present chal-lenges in router design [22]. Numerous routing algorithmshave also been proposed [23,24]. These are all orthogonal toEVCs, which target per-hop latency and energy rather thanthe hop-count. EVCs can complement any topology androuting algorithm to further drive network latency, through-put and energy towards the ideal. However, EVCs are in-spired by the express cubes topology [11]. In express cubes,extra physical links that span multiple hops are proposed toallow packets to skip intermediate router pipelines, and

shown to lead to significant energy savings when the up-per metal interconnect is used for such metal links in on-chip networks [17]. EVCs essentially virtualize the phys-ical express links of express cubes, thereby obviating theneed for the extra metal interconnects and highly-portedrouters at the sources and sinks of express links. Smallernumber of ports directly translates to smaller router areafootprint and lower energy consumption. Virtualizing thesephysical express links also means that there will not bewastage of physical link bandwidth when the traffic pat-tern is adversarial. In addition, EVCs lead to a higher re-duction in packet latencies as compared to express cubesunder low loads when network contention is low. If thetotal amount of wiring is assumed to be constant, an ex-press cube design would lead to longer packet sizes due toreduced wiring per channel. More specifically, a flat ex-press cube, with an interchange every k nodes, would leadto doubling of packet sizes, giving a latency for a packettraveling H hops (assuming no contention and equal routerand interchange delays which favors express cubes, and thatk | H) to be T = D/v + 2 · L/b + (H/k + k) · Trouter,whereas the corresponding latency in a k-hop static EVCdesign (assuming the aggressive express pipeline) is givenby T = D/v + L/b + (H/k + k − 1) · Trouter. It can beseen that express cubes increase the serialization delay dueto longer packet sizes. For hierarchical express cubes, thisoverhead grows directly in proportion to the number of in-terchange levels. Moreover, using dynamic EVCs, virtualexpress paths of a range of lengths can be created from ev-ery network node with a low overhead in reverse wiring ascompared to the wiring overhead of a similar design withmultiple physical express channels originating from everynode.

Ogras et al. [25] proposed long-range link insertion toconnect non-adjacent nodes in an application-aware fashion.This again requires additional flit-wide physical wiring andfatter routers with larger number of ports and hence larger

crossbar switches. In contrast, EVCs avoid these overheadsby using virtual connections between distant nodes.Microarchitectural techniques targeting Trouter andthroughput. Circuit switching first allocates channels toform a pre-reserved circuit prior to communication, so thatactual communication can occur with close-to-ideal networklatency, incurring just switch and link traversal latencieswithin the router pipeline. Researchers have also looked athybrid flow control mechanisms, such as wave switching [26]and pipeline circuit switching [27], which combine the ad-vantages of circuit and packet switching. However, circuitswitching and its variants incur a large circuit setup timewhich increases latency for short transfers. Circuit switch-ing also suffers in throughput, as links are not shared ef-ficiently amongst different message flows. Flit-reservationflow control [28] sends out control flits in advance to sched-ule buffers and channels along the way for subsequent dataflits. This can lower the router pipeline delay to close tothe ideal while improving throughput, but requires controlflits to be sent along fast upper metal wires, contending formetal layers with other global wiring. Besides, it requiresa complex router microarchitecture that leads to significanthardware overhead.

Our baseline router already incorporates several specula-tion mechanisms that target Trouter, such as bypassing andspeculation. Mullins et al. [12] propose a doubly-speculativesingle-stage design in which arbitration decisions are pre-computed to further reduce pipeline dependencies. Whilespeculative designs work well under low loads, they per-form poorly under heavy loads when network contention ishigh. Moreover, speculation often employs redundant logicwhich increases energy consumption. Our proposed EVCsare completely non-speculative and, hence, can improve per-formance at all traffic conditions while simultaneously lower-ing the energy consumption. Kim et al. [29] target through-put by proposing path-sensitive packet buffering and par-titioning the crossbar into separate row and column mod-ules to reduce contention. However, this scheme relies onbuffering of flits based on their output path and, hence, as-sumes multi-ported buffers which incur significant overheadin resource-constrained on-chip designs. EVCs, on the otherhand, push both latency and throughput towards the idealusing single-ported buffers and crossbars that are area- andenergy-efficient.7. CONCLUSION

Although the multiplexing of network channels over mul-tiple packet flows in packet-switched on-chip network de-signs leads to throughput gains, this comes with a signifi-cant latency, energy and area overhead in the form of com-plex routers. In this paper, we target this overhead in anattempt to approach the ideal interconnection fabric of ded-icated wires between all nodes. We propose EVCs, a novelflow control and router microarchitecture design, which usevirtual lanes in the network to allow packets to bypass nodesalong their path in a non-speculative fashion and, hence,significantly reduce delay and energy consumption. The vir-tual paths created by EVCs also help lower the level of con-tention in the network, thereby allowing the network to pushthrough more packets before saturation and, hence, improvethroughput. Since network nodes are virtually connected us-ing existing physical channels, EVCs improve performancewith a negligible wiring area overhead and minimal hard-ware complexity. We presented a detailed microarchitecturefor EVCs, analyzing the hardware and pipeline complex-ity of each of its components. A detailed evaluation, usingboth synthetic and actual workloads, showed EVCs signifi-cantly improving network energy/delay and throughput ascompared to a state-of-the-art packet-switched network, andclosely approaching the ideal interconnect.

AcknowledgmentsThe authors would like to thank William J. Dally of Stanford Uni-versity for useful feedback on this work. We would also like to thankTed Tabe and David V. James of Intel Corp. for their help withthe modeling infrastructure and comments on the microarchitecture.This work was supported in part by the MARCO Gigascale SystemsResearch Center, Alfred P. Sloan Research Foundation, a grant from

Intel Corporation, an Intel PhD Fellowship and NSF under grant no.CNS-0613074.

References[1] “International Technology Roadmap for Semiconductors,”

http://public.itrs.net.

[2] R. Ho, K. Mai, and M. Horowitz, “The future of wires,” Proc.IEEE, vol. 89, no. 4, Apr. 2001.

[3] K. Sankaralingam et al., “Exploiting ILP, TLP, and DLP withthe polymorphous TRIPS architecture,” in Proc. Int. Symp.Computer Architecture, June 2003, pp. 422–433.

[4] M. B. Taylor et al., “Evaluation of the Raw microprocessor:An exposed-wire-delay architecture for ILP and streams,” inProc. Int. Symp. Computer Architecture, June 2004.

[5] L. Benini and G. De Micheli, “Networks on chips: A new SoCparadigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, Jan.2002.

[6] W. J. Dally and B. Towles, “Route packets not wires: On-chipinterconnection networks,” in Proc. Design Automation Conf.,June 2001.

[7] J. A. Kahle et al., “Introduction to the Cell multiprocessor,”IBM Journal of Research and Development, vol. 49, no. 4/5,2005.

[8] M. Sgroi et al., “Addressing the system-on-a-chipinterconnection woes through communication-based design,” inProc. Design Automation Conf., June 2001.

[9] W. J. Dally, “Virtual-channel flow control,” in Proc. Int.Symp. Computer Architecture, May 1990, pp. 60–68.

[10] W. J. Dally and B. Towles, Principles and Practices ofInterconnection Networks. Morgan Kaufmann Publishers,2004.

[11] W. J. Dally, “Express cubes: Improving the performance ofk-ary n-cube interconnection networks,” IEEE Trans. onComputers, vol. 40, no. 9, Sept. 1991.

[12] R. Mullins, A. West, and S. Moore, “Low-latencyvirtual-channel routers for on-chip networks,” in Proc. Int.Symp. Computer Architecture, June 2004, pp. 188–197.

[13] L.-S. Peh and W. J. Dally, “A delay model and speculativearchitecture for pipelined routers,” in Proc. Int. Symp. HighPerformance Computer Architecture, Jan. 2001, pp. 255–266.

[14] “SPLASH-2,” http://www-flash.stanford.edu/apps/SPLASH/.

[15] M. Galles, “Scalable pipelined interconnect for distributedendpoint routing: The SGI SPIDER chip.” in Proc. HotInterconnects 4, Aug. 1996, pp. 141–146.

[16] S. S. Mukherjee, et al., “The Alpha 21364 networkarchitecture,” IEEE Micro, vol. 22, no. 1, pp. 26–35, Jan./Feb.2002.

[17] H.-S. Wang, L.-S. Peh, and S. Malik, “Power-driven design ofrouter microarchitectures in on-chip networks,” in Proc. Int.Symp. Microarchitecture, Nov. 2003, pp. 105–116.

[18] W. Liao and L. He, “Full-chip interconnect power estimationand simulation considering repeater insertion and flip-flopinsertion,” in Proc. Int. Conf. Computer-Aided Design, Nov.2003, pp. 574–580.

[19] H.-S. Wang, et al., “Orion: A power-performance simulator forinterconnection networks,” in Proc. Int. Symp.Microarchitecture, Nov. 2002, pp. 294–305.

[20] X.-N. Chen and L.-S. Peh, “Leakage power modeling andoptimization of interconnection networks,” in Proc. Int. Symp.Low Power Electronics and Design, Aug. 2003, pp. 90–95.

[21] K. P. Lawton, “Bochs: A portable PC emulator for Unix/X,”Linux J., vol. 1996, no. 29, p. 7, 1996.

[22] J. Kim, et al., “Microarchitecture of a high-radix router,” inProc. Int. Symp. Computer Architecture, June 2006, pp.420–431.

[23] J. Hu and R. Marculescu, “DyAD - Smart routing fornetworks-on-chip,” in Proc. Design Automation Conf., June2004.

[24] D. Seo, et al., “Near-optimal worst-case throughput routing fortwo-dimensional mesh networks,” in Proc. Int. Symp.Computer Architecture, June 2005.

[25] U. Y. Ogras and R. Marculescu, “It’s a small world after all:NoC performance optimization via long-range link insertion,”IEEE Trans. Very Large Scale Integration Systems, vol. 14,no. 7, pp. 693–706, July 2006.

[26] J. Duato, et al., “A high performance router architecture forinterconnection networks,” in Proc. Int. Conf. ParallelProcessing, Aug. 1996, pp. 61–68.

[27] P. T. Gaughan and S. Yalamanchili, “A family of fault-tolerantrouting protocols for direct multiprocessor networks,” IEEETrans. Parallel and Distributed Systems, vol. 6, no. 5, May1995.

[28] L.-S. Peh and W. J. Dally, “Flit-reservation flow control,” inProc. Int. Symp. High Performance Computer Architecture,Jan. 2000, pp. 73–84.

[29] J. Kim, et al., “A gracefully degrading and energy-efficientmodular router architecture for on-chip networks,” in Proc.Int. Symp. Computer Architecture, June 2006, pp. 4–15.

Date post:	02-Feb-2021
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Express Virtual Channels: Towards the Ideal Interconnection ...kubitron/cs258/...Express Virtual...

Documents