Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin...

Destination-Based Adaptive Routing for 2D Mesh Networks

ANCS 2010

Rohit Sunkam RamanujamBill Lin

Electrical and Computer EngineeringUniversity of California, San Diego

Networks-on-Chip• Chip-multiprocessors (CMPs) increasingly popular• 2D-mesh networks often used as on-chip fabric• Routing algorithm central in determining performance

Tilera Tile64Intel 48-core data center on die

(ISSCC 2010)

Classes of Routing Algorithms

• Oblivious routing +Simple and fast router designs– Poor load balancing under bursty traffic

• Adaptive routing+Better performance (throughput, latency) +Better fault tolerance- Higher router complexity

Related Work• Oblivious Routing [Valiant, ROMM, O1TURN,

Optimal oblivious routing]– Optimize for worst and average-case performance

• Adaptive routing commercially used in multiprocessors from IBM, Cray, Compaq

• On-chip routing very different from off-chip:– Lower power– Lower area – Lower router complexity

Outline

Introduction• Motivation• Destination-Based Adaptive Routing (DAR)• Evaluation

Minimal Adaptive Routing• Model– Adaptive routing along minimal directions

D

S

Coarse Fine

Granularity of Congestion Estimation

Local congestion

Local Congestion

• Local adaptive– Measure local congestion metric (free VC, free buffers)

S

Low congestion

Moderate congestion

D

High congestionOptimal

Local adaptive

Coarse Fine


Local congestion

Dimension-based congestion

Dimension-based Congestion• RCA-1D (Gratz et al. HPCA’ 08)– Exponential moving average of congestion to all

nodes along a dimension

S

Low congestion

Moderate congestion

D


RCA-1D

Coarse Fine


Local congestion


Quadrant-based congestion

Quadrant-based Congestion• RCA-Quadrant (Gratz et al. HPCA’ 08)– Exponential moving average of congestion to all

nodes in the destination quadrant

S

Low congestion

Moderate congestion

D




S

Low congestion

Moderate congestion

D




S

Low congestion

Moderate congestion

D


RCA-quad

Coarse Fine


Local congestion



Destination-based congestion

Ideally …• On a per-destination basis:– Estimate end-to-end delay along all minimal paths to

destination– Choose path with least delay

S

Low congestion

Moderate congestion

D


Challenges

• Limited bandwidth for congestion updates– Congestion notification not instantaneous

• Limited storage in on-chip routers– Exponential number of paths to each destination

• Limited hardware resources for computations

How can we practically emulate ideal adaptive routing?

Destination-based adaptive routing (DAR)

• A node estimates delay to all other nodes through candidate outputs every T cycles

S

D

L[N][D] = 20

L[E][D] = 30

DAR-High Level• Traffic distribution to output ports controlled

using per-destination split ratios W

W[N][D]= 0.6

W[E][D]= 0.4

S

D

Estimate delay to destination through candidate outputs

Shift traffic from more congested port to less

congested port

Start with initial set of split ratios

L[N][D] = 20

L[E][D] = 30

DAR-High Level• Traffic distribution to output ports controlled

using per-destination split ratios W

Estimate delay to destination through candidate outputs

S

D

Shift traffic from more congested port to less

congested port

Start with initial set of split ratios

W[N][D]= 0.8

W[E][D]= 0.2

L[N][D] = 20

L[E][D] = 30

Outline

IntroductionMotivation• Destination-Based Adaptive Routing (DAR)– Distributed delay measurement– Split ratio adaptation– Scaling

• Evaluation

Distributed Delay Measurement• A node maintains:– Per-destination traffic split ratio through candidate

output ports: W[p][j]

– Delay to next-hop router/ejection interface through each output port (N, S, E, W, Ej): l[p]

Distributed Delay Measurement• Every node estimates average delay to all

other nodes in the network

12 13 14 15

8

4

0

9

5

11

6 7

1 2 3

10

Avg10[10]

Avg10[10]

Avg10[10]

Avg10[10]

1. Delay from 10 to itself, Avg10[10] = l10[Ej]

2. Avg10[10] propagated to neighbors

3. Nodes 6, 9, 14, 11 add local delay to Avg10[10] to compute delay to node 10

4. For example, at node 9, L[E][10] = l[E] + Avg10[10] Avg9[10] = L[E][10]

Distributed Delay Measurement• Every node estimates delay to all other nodes

in the network

12 13 14 15

8

4

0

9

5

11

6 7

1 2 3

10

Avg14[10]

Avg11[10]Avg9[10]

1.Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors

2.For example, node 5 receives two delay updates, from nodes 9 and 6

A[E][10] = Avg6[10]

A[N][10] = Avg9[10]3.Node 5 adds local link delay to received delay

update: L[E][10]

= A[E][10] + l[E] L[N][10] = A[N][10] + l[N]

4.Finally, average delay from node 5 to node 10 is computed as: Avg5[10] = W[E][10]L[E][10] + W[N][10]L[N][10]

Avg14[10]

Avg9[10]

Avg9[10]

Avg6[10]

Avg6[10]

Avg6[10]

Avg11[10]

Distributed Delay Measurement• Every node estimates delay to all other nodes

in the network

12 13 14 15

8

4

0

9

5

11

6 7

1 2 3

10

1.Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors

2.For example, node 5 receives two delay updates, from nodes 9 and 6

A[E][10] = Avg6[10]

A[N][10] = Avg9[10]3.Node 5 adds local link delay to received delay

update: L[E][10]

= A[E][10] + l[E] L[N][10] = A[N][10] + l[N]

4.Finally, average delay from node 5 to node 10 is computed as: Avg5[10] = W[E][10]L[E][10] + W[N][10]L[N][10]

Outline

IntroductionMotivation• Destination-Based Adaptive Routing (DAR)

Distributed delay measurement– Split ratio adaptation– Scaling

• Evaluation

Adaptation of Split ratio

• Objective: Equalize delay on candidate output ports

• If only one candidate output, split ratio is 1

• If two candidate outputs,– Let ph be the port with higher delay to destination j

– Let pl be the port with lower delay to destination j

– W[ph][j] + W[pl][j] = 1

– Δ traffic shifted from ph to pl every T cycles

– Δ proportional to (L[ph][j]-L[pl][j])/L[ph][j]

Coarse Fine


Local congestion




Does not scale !!

Coarse Fine


Local congestion




Scalable Destination-

based congestion

Outline

IntroductionMotivation• Destination-Based Adaptive Routing (DAR)

Distributed delay measurementSplit ratio adaptation– Scaling

• Evaluation

Look-ahead Window

PA

B

C A

PC

PB

S

•Node S maintains delay estimate for MxM window centered at S.

•Any node outside window mapped to closest node within window

•A packet’s look-ahead window shifts as it is routed from source to destination

Window Size

• Destination D guaranteed to be within window when packet is (M-1)/2 hops away from D

• Intuition: Packet has (M-1)/2 hops to route around congestion hot spots

• 7x7 look-ahead window in 16x16 mesh has comparable performance to DAR (equivalent to 31x31 look-ahead window)

Outline

IntroductionRelated workDestination-Based Adaptive Routing (DAR)• Evaluation

Experimental setup

• Compare DAR with RCA-1D, RCA-quadrant, Local adaptive

• SPLASH-2 benchmarks + synthetic traffic patterns (uniform, transpose, shuffle)

• Cycle-accurate NoC simulator models 3-stage router pipeline

• 8 VC, 5 flit deep

• 1 VC used as escape VC for deadlock prevention

fft lu

waterns

waters

radix

raytra

ce

barnes

ocean

Averag

e

Geomea

n0

0.2

0.4

0.6

0.8

1

1.2DAR RCA-quadrant RCA-1D Local

Nor

mal

ized

pack

et la

tenc

ySplash results – 7x7 mesh

41%

fft lu

waterns

waters

radix

raytra

ce

barnes

ocean

Averag

e

Geomea

n0

0.2

0.4

0.6

0.8

1

1.2DAR RCA-quadrant RCA-1D Local

Nor

mal

ized

pack

et la

tenc

ySplash results – 7x7 mesh

65%

Uniform traffic – 8x8 mesh

Transpose traffic – 8x8 mesh

Shuffle traffic – 8x8 mesh

SDAR - 16x16 mesh, 7x7 window

DAR SDAR RCA-quad RCA-1D Local0

50

100

150

200

250

300

Aver

age

pack

et la

tenc

y (c

ycle

s)

Average latency over 100 permutation traffic patterns at 18% injection load

DAR SDAR RCA-quad RCA-1D Local0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Network Saturates Network below saturation

Network saturation statistics at 18% injection load

Summary• Destination-based Adaptive Routing (DAR) for

2D mesh networks

• Scalable DAR (SDAR) uses look-ahead window and easily scales to large networks

• DAR outperforms existing adaptive and oblivious routing

• SDAR achieves comparable performance with significantly less overheads

Thank you!!

Key implementation details

• Simple router implementation: low storage, low bandwidth

• Synchronize delay updates to reuse delay computation and weight adaptation hardware

• Approximate computations to simplify implementation

Router architecture – Kim et al DAC ‘05Quadrant

PortPre-select

VC-1

VC Allocator

XB Allocator

.

.

.

N

VC-v

.

.

.

S

E

W

VC-1...

VC-v

Preferred Output Registers

InNSEWEj

Congestion Value Registers

Credits

Routing Unit

Override

Credits

DAR Router

W

λ

L[py][N-1]p[N-1]

p[1]p[0]

Destination

PortPre-select

VC-1

W[px, py][0]

W[px, py][1]

W[px, py][N-1]

Adapt Weights

Latencymeasurement

VC Allocator

XB Allocator

cnt[P-1]

cnt[0]

.

.

.

Increment/Decrement

.

.

.

.

.

.

A[px][0]

A[py][0]

A[px][N-1]

A[py][N-1]

...

L[px][0]

L[py][0]

L[px][N-1]

.

.

.

.

.

.

.

.

.

Latency Propagation

.

.

.

Avg[0]

Avg[N-1]

.

.

.

Storage Overhead

Logic Overhead

N

VC-v

.

.

.

S

E

VC-1...

VC-v

Preferred output registers

Per-destination Split ratios

Local delay

In

NSEWEj

l[P-1]

l[1]l[0]

.

.

.

Exponentially averaged

local delay

cnt[1]

Distributed delay measurement

• A node maintains:– Per-destination traffic split ratio through candidate

output ports: W[p][j]– Delay to next-hop router/ejection interface through

each output port (N, S, E, W, Ej): l[p]• Using updates received from downstream

nodes, a node computes:– L[p][j]: Average delay from current node to node j

through output port p – Avg[j]: Average delay from current node to node j

Destination-based Adaptive Routing (DAR)• Every router maintains per-destination split ratios which

control traffic distribution to output ports• Split ratios adjusted every T cycles based on measured

delay to D through the two ports

S

Low congestion

Moderate congestion

D

High congestion0.8

0.2

0.7

0.3

1 1

Date post:	18-Dec-2015
Category:	Documents
Upload:	rodney-daniels
View:	213 times
Download:	0 times

Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin...

Documents