Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
Low-Latency Pipelined Crossbar Arbitration
Cyriel Minkenberg, Ilias Iliadis, François AbelIBM Research, Zurich Research Laboratory
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
Outline
Context OSMOSIS project
Problem Low-latency, high-throughput crossbar arbitration in FPGAs
Approach A new way to pipeline parallel iterative matching algorithms
Simulation results Latency-throughput as a function of pipeline depth
Conclusions
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
OSMOSIS
Optical Shared MemOry Supercomputer Interconnect System Sponsored by DoE & NNSA as part of ASCI Joint 2½-year project
• Corning: Optics and packaging• IBM: Electronics (arbiter, input and output adapters) and system integration
High-Performance Computing (HPC) Massively parallel computers (e.g. Earth Simulator, Blue Gene) Low-latency, high-bandwidth, scalable interconnection networks
Main sponsor objective Solving the technical challenges and accelerating the cost reduction of all-optical
packet switches for HPCS interconnects by• building a full-function all-optical packet switch demonstrator system• showing the scalability, performance and cost paths for a potential commercial system
Key requirements: Latency < 1 μs end-to-end
Line rate 40 Gb/s/port
Number of ports 64
Efficiency 75% user data
Error rate BER < 10-21
Implementation FPGA-only
Scalability 2048 nodes via 3-stage Fat Tree topology
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
OSMOSIS System Architecture
Broadcast-and-select architecture (crossbar) Combination of wavelength- and space-division multiplexing Fast switching based on SOAs Electronic input and output adapters Electronic arbitration
EQ
control
2 Rx
central arbiter(bipartite graph matching algorithm)
VOQs
Tx
control
64 Ingress Adapters
All-optical Switch
64 Egress Adapters
EQ
control
2 Rx
control links
8 Broadcast Units128 Select Units
8x11x88x1
Com- biner
Fast SOA 1x8FiberSelectorGates
Fast SOA 1x8FiberSelectorGates
Fast SOA 1x8WavelengthSelectorGates
Fast SOA 1x8WavelengthSelectorGates
OpticalAmplifier
WDM Mux
StarCoupler
8x1 1x128VOQs
Tx
control
1
8
1
128
1
64
1
64
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
OSMOSIS Arbitration
Crossbar arbitration Heuristic parallel iterative matching algorithms
RRM, PIM, i-SLIP, FIRM, DRRM, etc.
These require I = log2 N iterations to achieve good performance
Mean latency decreases as the number of iterations increases
OSMOSIS N = 64 I = 6 iterations
Problem
• An iteration takes too long (Ti) to complete I iterations in one time slot Tc
• VHDL experiments indicate that Ti Tc 2Ti
• Poor performance…
Solution
• Pipelining
• however, in general this incurs a latency penalty
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
Parallel Matching: PMM
K parallel matching units (allocators) Every allocator now has K time slots to compute a matching K = I( Ti / Tc )
Requests/grants issued in round-robin TDM fashion In every time slot, one allocator receives a set of requests, and one allocator issues a set of grants (and is reset)
Drawbacks Minimum arbitration latency equals K time slots Allocators cannot take most recent arrivals into account in subsequent iterations
A0
…
AK-1
requests matchingA1
M0[1]
M1[2]
M0[2]
M0[4]
M3[4]
M1[4]
M2[4]
M0[3]
M2[3]
M1[3]
A0
A1
…
AK-1
Tc Tarbitration
requests
grants
Ti
allocators
time
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
FLPPR: Fast Low-latency Parallel Pipelined aRbitration
A3
A1
A0
requests matchingA2VOQstate
REQUEST Requests are issued depending on VOQ state to all or a subset of the allocators
MATCH Every allocator Ai adds new edges based on current requests and existing matching
UPDATE New edges are accounted for by updating the VOQ state
SHIFT Ai > 0 forwards resulting matching to Ai-1 at end of every time slot AK-1 starts with empty matching
Ai < K -1 start with previous matching of Ai+1
A0 issues the final matching
M3[1]
M2[1]
M1[1]
M0[1]
M2[2]
M3[2]
M3[1]
M2[1]
M1[1]
M1[2]
M0[2]
M3[1]
M2[2]
M1[3]
M2[3]
M3[3]
M3[2]
M2[1]
M1[2]
M0[3]
M0[4]
M3[1]
M1[3]
M2[2]
M3[2]
M2[3]
M1[4]
M2[4]
M3[4]
M3[3]
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
Request and Grant Filtering
PMM = Parallel allocators, TDM requests; FLPPR = Pipelined allocators, parallel requests FLPPR allows requests to be issued to any allocator in any time slot Request filtering function determines the subset of allocators for every VOQ
Opportunity for performance optimization by selectively submitting requests to and accepting grants from specific allocators
Request and grant filtering
General class of algorithms Request filter Rk determines mapping between allocators and requests
• Selective requests depending on Lij, Mk, k Grant filter Fk can remove excess grants
A0
…
AK-1
requestsmatchingA1
R0
R1
…
R3
VOQstate
F0
F1
…
F3
line cardrequests
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
Example, N = 4, K = 2 without filtering
0 3 01
1 2 00
0 6 24
2 0 01
0 1 01
1 1 00
0 1 11
1 0 01
0 1 01
1 1 00
0 1 11
1 0 01
0 1 00
1 0 00
0 0 01
0 0 00
0 1 00
1 0 00
0 0 10
0 0 01
0 2 00
2 0 00
0 0 11
0 0 01
VOQ ctrsLij
requestsRk
matchesMk
grantsGij
0 1 01
0 2 00
0 6 13
2 0 00
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
Example, N = 4, K = 2, with request filtering
0 3 01
1 2 00
0 6 24
2 0 01
0 1 01
1 1 00
0 1 11
1 0 01
0 1 00
0 1 00
0 1 11
1 0 00
0 1 00
1 0 00
0 0 01
0 0 00
0 1 00
0 0 00
0 0 10
1 0 00
0 2 00
1 0 00
0 0 11
1 0 00
VOQ ctrsLij
requestsRk
matchesMk
grantsGij
0 1 01
0 2 00
0 6 13
1 0 01
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
FLPPR Methods
We define three FLPPR variants
Method 1: Broadcast requests, selective post-filtering
• Requests sent to all allocators; excess grants are cancelled
Method 2: Broadcast requests, no post-filtering
• Requests sent to all allocators; no check for excess grants
• May lead to “wasted” grants
Method 3: Selective requests, no post-filtering
• Requests sent selectively (no more requests than current VOQ occupancy); no check for excess grants
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
FLPPR performance – Uniform Bernoulli traffic
Method 1
0.01
0.1
1
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
normalized throughput
late
ncy
[ti
me
slo
ts]
K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)
Method 2
0.01
0.1
1
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
normalized throughput
late
ncy
[ti
me
slo
ts]
K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)
Method 3
0.01
0.1
1
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
normalized throughput
late
ncy
[ti
me
slo
ts]
K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)
Comparison of FLPPR and PMM
0.01
0.1
1
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
normalized throughput
late
ncy
[ti
me
slo
ts]
PMM, K = 5FLPPR method 1, K = 5FLPPR method 2, K = 5FLPPR method 3, K = 55-SLIP (K = 1)
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
FLPPR performance – Nonuniform Bernoulli traffic
Method 1
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
nonuniformity (w)
no
rmal
ized
th
rou
gh
pu
t
K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)
Method 2
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
nonuniformity (w)
no
rmal
ized
th
rou
gh
pu
t
K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)
Method 3
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
nonuniformity (w)
no
rmal
ized
th
rou
gh
pu
t
K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
Arbiter Implementation
Fj
VOQstate
CCitf.
CC[00]Rx
TxSCI
SCC[00]
SCC[15]
Tx
Tx
VOQstate
CCitf.
CC[63]Rx
Tx
RjRj
Fj
Fj
RjRj
Fj
SYSCLK
&CTRL
SCI
A[K-1]
A[0]
Zurich Research Laboratory
GLOBECOM 2004 © 2004 IBM Corporation
Conclusions
Problem: Short packet duration makes it hard to complete enough iterations
Pipelining achieves high rates of matching with a highly distributed implementation
FLPPR pipelining with parallel requests has performance advantages Eliminates pipelining latency at low load
Achieves 100% throughput with uniform traffic
Reduces latency with respect to PMM also at high load
Can improve throughput with nonuniform traffic
Request pre- and post-filtering allows performance optimization Different traffic types may require different filtering rules
Future work: Find filtering functions that optimize uniform and non-uniform performance
Highly amenable to distributed implementation in FPGAs
Can be applied to any existing iterative matching algorithm