Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | rodney-daniels |
View: | 213 times |
Download: | 0 times |
Destination-Based Adaptive Routing for 2D Mesh Networks
ANCS 2010
Rohit Sunkam RamanujamBill Lin
Electrical and Computer EngineeringUniversity of California, San Diego
Networks-on-Chip• Chip-multiprocessors (CMPs) increasingly popular• 2D-mesh networks often used as on-chip fabric• Routing algorithm central in determining performance
Tilera Tile64Intel 48-core data center on die
(ISSCC 2010)
Classes of Routing Algorithms
• Oblivious routing +Simple and fast router designs– Poor load balancing under bursty traffic
• Adaptive routing+Better performance (throughput, latency) +Better fault tolerance- Higher router complexity
Related Work• Oblivious Routing [Valiant, ROMM, O1TURN,
Optimal oblivious routing]– Optimize for worst and average-case performance
• Adaptive routing commercially used in multiprocessors from IBM, Cray, Compaq
• On-chip routing very different from off-chip:– Lower power– Lower area – Lower router complexity
Outline
Introduction• Motivation• Destination-Based Adaptive Routing (DAR)• Evaluation
Minimal Adaptive Routing• Model– Adaptive routing along minimal directions
D
S
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Local Congestion
• Local adaptive– Measure local congestion metric (free VC, free buffers)
S
Low congestion
Moderate congestion
D
High congestionOptimal
Local adaptive
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Dimension-based Congestion• RCA-1D (Gratz et al. HPCA’ 08)– Exponential moving average of congestion to all
nodes along a dimension
S
Low congestion
Moderate congestion
D
High congestionOptimal
RCA-1D
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Quadrant-based congestion
Quadrant-based Congestion• RCA-Quadrant (Gratz et al. HPCA’ 08)– Exponential moving average of congestion to all
nodes in the destination quadrant
S
Low congestion
Moderate congestion
D
High congestionOptimal
Quadrant-based Congestion• RCA-Quadrant (Gratz et al. HPCA’ 08)– Exponential moving average of congestion to all
nodes in the destination quadrant
S
Low congestion
Moderate congestion
D
High congestionOptimal
Quadrant-based Congestion• RCA-Quadrant (Gratz et al. HPCA’ 08)– Exponential moving average of congestion to all
nodes in the destination quadrant
S
Low congestion
Moderate congestion
D
High congestionOptimal
RCA-quad
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Quadrant-based congestion
Destination-based congestion
Ideally …• On a per-destination basis:– Estimate end-to-end delay along all minimal paths to
destination– Choose path with least delay
S
Low congestion
Moderate congestion
D
High congestionOptimal
Challenges
• Limited bandwidth for congestion updates– Congestion notification not instantaneous
• Limited storage in on-chip routers– Exponential number of paths to each destination
• Limited hardware resources for computations
How can we practically emulate ideal adaptive routing?
Destination-based adaptive routing (DAR)
• A node estimates delay to all other nodes through candidate outputs every T cycles
S
D
L[N][D] = 20
L[E][D] = 30
DAR-High Level• Traffic distribution to output ports controlled
using per-destination split ratios W
W[N][D]= 0.6
W[E][D]= 0.4
S
D
Estimate delay to destination through candidate outputs
Shift traffic from more congested port to less
congested port
Start with initial set of split ratios
L[N][D] = 20
L[E][D] = 30
DAR-High Level• Traffic distribution to output ports controlled
using per-destination split ratios W
Estimate delay to destination through candidate outputs
S
D
Shift traffic from more congested port to less
congested port
Start with initial set of split ratios
W[N][D]= 0.8
W[E][D]= 0.2
L[N][D] = 20
L[E][D] = 30
Outline
IntroductionMotivation• Destination-Based Adaptive Routing (DAR)– Distributed delay measurement– Split ratio adaptation– Scaling
• Evaluation
Distributed Delay Measurement• A node maintains:– Per-destination traffic split ratio through candidate
output ports: W[p][j]
– Delay to next-hop router/ejection interface through each output port (N, S, E, W, Ej): l[p]
Distributed Delay Measurement• Every node estimates average delay to all
other nodes in the network
12 13 14 15
8
4
0
9
5
11
6 7
1 2 3
10
Avg10[10]
Avg10[10]
Avg10[10]
Avg10[10]
1. Delay from 10 to itself, Avg10[10] = l10[Ej]
2. Avg10[10] propagated to neighbors
3. Nodes 6, 9, 14, 11 add local delay to Avg10[10] to compute delay to node 10
4. For example, at node 9, L[E][10] = l[E] + Avg10[10] Avg9[10] = L[E][10]
Distributed Delay Measurement• Every node estimates delay to all other nodes
in the network
12 13 14 15
8
4
0
9
5
11
6 7
1 2 3
10
Avg14[10]
Avg11[10]Avg9[10]
1.Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors
2.For example, node 5 receives two delay updates, from nodes 9 and 6
A[E][10] = Avg6[10]
A[N][10] = Avg9[10]3.Node 5 adds local link delay to received delay
update: L[E][10]
= A[E][10] + l[E] L[N][10] = A[N][10] + l[N]
4.Finally, average delay from node 5 to node 10 is computed as: Avg5[10] = W[E][10]L[E][10] + W[N][10]L[N][10]
Avg14[10]
Avg9[10]
Avg9[10]
Avg6[10]
Avg6[10]
Avg6[10]
Avg11[10]
Distributed Delay Measurement• Every node estimates delay to all other nodes
in the network
12 13 14 15
8
4
0
9
5
11
6 7
1 2 3
10
1.Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors
2.For example, node 5 receives two delay updates, from nodes 9 and 6
A[E][10] = Avg6[10]
A[N][10] = Avg9[10]3.Node 5 adds local link delay to received delay
update: L[E][10]
= A[E][10] + l[E] L[N][10] = A[N][10] + l[N]
4.Finally, average delay from node 5 to node 10 is computed as: Avg5[10] = W[E][10]L[E][10] + W[N][10]L[N][10]
Outline
IntroductionMotivation• Destination-Based Adaptive Routing (DAR)
Distributed delay measurement– Split ratio adaptation– Scaling
• Evaluation
Adaptation of Split ratio
• Objective: Equalize delay on candidate output ports
• If only one candidate output, split ratio is 1
• If two candidate outputs,– Let ph be the port with higher delay to destination j
– Let pl be the port with lower delay to destination j
– W[ph][j] + W[pl][j] = 1
– Δ traffic shifted from ph to pl every T cycles
– Δ proportional to (L[ph][j]-L[pl][j])/L[ph][j]
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Quadrant-based congestion
Destination-based congestion
Does not scale !!
Coarse Fine
Granularity of Congestion Estimation
Local congestion
Dimension-based congestion
Quadrant-based congestion
Destination-based congestion
Scalable Destination-
based congestion
Outline
IntroductionMotivation• Destination-Based Adaptive Routing (DAR)
Distributed delay measurementSplit ratio adaptation– Scaling
• Evaluation
Look-ahead Window
PA
B
C A
PC
PB
S
•Node S maintains delay estimate for MxM window centered at S.
•Any node outside window mapped to closest node within window
•A packet’s look-ahead window shifts as it is routed from source to destination
Window Size
• Destination D guaranteed to be within window when packet is (M-1)/2 hops away from D
• Intuition: Packet has (M-1)/2 hops to route around congestion hot spots
• 7x7 look-ahead window in 16x16 mesh has comparable performance to DAR (equivalent to 31x31 look-ahead window)
Outline
IntroductionRelated workDestination-Based Adaptive Routing (DAR)• Evaluation
Experimental setup
• Compare DAR with RCA-1D, RCA-quadrant, Local adaptive
• SPLASH-2 benchmarks + synthetic traffic patterns (uniform, transpose, shuffle)
• Cycle-accurate NoC simulator models 3-stage router pipeline
• 8 VC, 5 flit deep
• 1 VC used as escape VC for deadlock prevention
fft lu
waterns
waters
radix
raytra
ce
barnes
ocean
Averag
e
Geomea
n0
0.2
0.4
0.6
0.8
1
1.2DAR RCA-quadrant RCA-1D Local
Nor
mal
ized
pack
et la
tenc
ySplash results – 7x7 mesh
41%
fft lu
waterns
waters
radix
raytra
ce
barnes
ocean
Averag
e
Geomea
n0
0.2
0.4
0.6
0.8
1
1.2DAR RCA-quadrant RCA-1D Local
Nor
mal
ized
pack
et la
tenc
ySplash results – 7x7 mesh
65%
Uniform traffic – 8x8 mesh
Transpose traffic – 8x8 mesh
Shuffle traffic – 8x8 mesh
SDAR - 16x16 mesh, 7x7 window
DAR SDAR RCA-quad RCA-1D Local0
50
100
150
200
250
300
Aver
age
pack
et la
tenc
y (c
ycle
s)
Average latency over 100 permutation traffic patterns at 18% injection load
DAR SDAR RCA-quad RCA-1D Local0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Network Saturates Network below saturation
Network saturation statistics at 18% injection load
Summary• Destination-based Adaptive Routing (DAR) for
2D mesh networks
• Scalable DAR (SDAR) uses look-ahead window and easily scales to large networks
• DAR outperforms existing adaptive and oblivious routing
• SDAR achieves comparable performance with significantly less overheads
Thank you!!
Key implementation details
• Simple router implementation: low storage, low bandwidth
• Synchronize delay updates to reuse delay computation and weight adaptation hardware
• Approximate computations to simplify implementation
Router architecture – Kim et al DAC ‘05Quadrant
PortPre-select
VC-1
VC Allocator
XB Allocator
.
.
.
N
VC-v
.
.
.
S
E
W
VC-1...
VC-v
Preferred Output Registers
InNSEWEj
Congestion Value Registers
Credits
Routing Unit
Override
Credits
DAR Router
W
λ
L[py][N-1]p[N-1]
p[1]p[0]
Destination
PortPre-select
VC-1
W[px, py][0]
W[px, py][1]
W[px, py][N-1]
Adapt Weights
Latencymeasurement
VC Allocator
XB Allocator
cnt[P-1]
cnt[0]
.
.
.
Increment/Decrement
.
.
.
.
.
.
A[px][0]
A[py][0]
A[px][N-1]
A[py][N-1]
...
L[px][0]
L[py][0]
L[px][N-1]
.
.
.
.
.
.
.
.
.
Latency Propagation
.
.
.
Avg[0]
Avg[N-1]
.
.
.
Storage Overhead
Logic Overhead
N
VC-v
.
.
.
S
E
VC-1...
VC-v
Preferred output registers
Per-destination Split ratios
Local delay
In
NSEWEj
l[P-1]
l[1]l[0]
.
.
.
Exponentially averaged
local delay
cnt[1]
Distributed delay measurement
• A node maintains:– Per-destination traffic split ratio through candidate
output ports: W[p][j]– Delay to next-hop router/ejection interface through
each output port (N, S, E, W, Ej): l[p]• Using updates received from downstream
nodes, a node computes:– L[p][j]: Average delay from current node to node j
through output port p – Avg[j]: Average delay from current node to node j
Destination-based Adaptive Routing (DAR)• Every router maintains per-destination split ratios which
control traffic distribution to output ports• Split ratios adjusted every T cycles based on measured
delay to D through the two ports
S
Low congestion
Moderate congestion
D
High congestion0.8
0.2
0.7
0.3
1 1