NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Design of a High-Throughput
Distributed Shared-Buffer NoC Router
Rohit Sunkam Ramanujam*, Vassos Soteriou†,
Bill Lin*, Li-Shiuan Peh‡
*Dept. of Electrical Engineering, UCSD, USA†Dept. of Electrical Engineering, CUT, Cyprus
‡Dept. of Electrical Eng. and Computer Science, MIT, USA
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Chip Multiprocessor
Uniprocessor
Chip Multiprocessors are a reality
Sources: Intel Inc. and Tilera Inc.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
The need for a Network on Chip (NoC)
• Scalable communication• Modular design• Efficient use of wires• A new way to organize and build VLSI
systems
Compute Unit
Router
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
The Problem – Delivering high throughput in NoCs
• Why Care?
– NoCs in CMPs connect general-purpose processors.
– Future applications unknown → traffic unknown.
– Exploiting parallelism needs fine-grained interaction
between cores.
– Can expect high traffic volume for current and
future applications running on many-core
processors.
– E.g. Cache coherence between large number of
distributed shared L2 caches.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
An important design choice that
affects throughput
• Router microarchitecture
– How well does a router multiplex packets onto its
output links?
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
NoC routers – Current design
cycle = 1cycle = 2
Input 1
Input 2
Output 1
Output 2
Maximal Matching: Input 1 → Output 1Maximal Matching: Input 2 → Output 1
Output 2 is unutilized in cycle 3 although there is a flit destined for output 2. Bottleneck: Maximal matching used for arbitration is not good enough.
(70-80% efficiency)
Input Buffered Routers (IBRs) – Flits buffered at the input ports
cycle = 3
Crossbar
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Output queueing to the rescue …
cycle = 1cycle = 2cycle = 3
Input 1
Input 2
Output 1
Output 2
Output links are always utilized when there are flits available.Better multiplexing of flits onto output links ⇒ higher throughput.
Crossbar
Output buffered router (OBR) – Flits buffered at the output ports
NoCs 2010, Grenoble, FranceMay 4, 2010
How much difference does it make?
A throughput gap of 18%!
Uniform Traffic
Ramanujam, Soteriou, Lin, Peh
NoCs 2010, Grenoble, FranceMay 4, 2010
A throughput gap of 12%!
Complement Traffic
How much difference does it make?
Ramanujam, Soteriou, Lin, Peh
NoCs 2010, Grenoble, FranceMay 4, 2010
A throughput gap of 22%!
Tornado Traffic
How much difference does it make?
Ramanujam, Soteriou, Lin, Peh
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Performance impact on real applications
Up to 98% reduction in average packet latency
Splash 2 benchmark applications
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Output Buffering is great …
• OBRs offer much higher throughput than IBRs.
• OBRs have predictable delay.
– Queuing delay modeled using M/D/1 queues.
• Packet delays not predictable for IBRs.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
So why aren’t OBRs used in NoCs ?
• Implementing Output Buffering requires either:
– Crossbar speedup of P, where P is the number of ports.
Not practical for aggressively clocked designs.
– Output buffers with P write ports and a PxP2 crossbar.
Has huge area and power penalties.
.
.
.
Input 2
Input P-1
Input 1Output 1
.
.
.
Output P-1
Crossbar
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Our approach: Emulate Output
Queueing without any speedup
Input 1
Input 2
Input 3
Crossbar 1 Crossbar 2Middle Memories
Output 1
Output 2
Output 3
Step1: Timestamp the flitsAssign a future time at which a
flit would depart the router assuming output buffering.
5
6
Step2: Find a conflict-free middle memory. Step4: When current time == timestamp,
Read flit from middle memory to output port.
Current time = 1Current time = 1Current time = 2Current time = 2Current time = 3Current time = 3Current time = 5Current time = 5Current time = 6Current time = 6
4
Step3: Move flits from input buffers to middle memories.
Current time = 4Current time = 4
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Arrival and Departure Conflicts
• Arrival Conflicts – With P input ports, a flit can
have an arrival conflict with P-1 other flits.
• Departure Conflicts – With P output ports, a flit
can have a departure conflict with P-1 other
flits.
• By Pigeon hole principle, 2P-1 middle memories
needed to avoid all arrival and departure
conflicts.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
We Propose: The Distributed
Shared-Buffer Router (DSB)
• Aims at emulating the packet servicing scheme of an OBR with limited buffers and no speedup.
– First-Come-First-Served servicing of flits.
Objectives:– Close the performance gap between OBRs with infinite
buffers and IBRs (high throughput).– Make a feasible design → low power and area overhead.– Make packet delays more predictable for delay sensitive
NoC applications.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
DSB Router
Innovations
– Router pipeline with new stages for:
• Timestamping flits
• Finding a conflict free middle memory
– Complexity and delay-balanced pipeline stages for a
high-clocked, high-performance implementation.
– New flow control to prevent packet dropping when
resources are unavailable.
– Evaluate power-performance tradeoff of DSB
architectures with fewer than 2P-1 middle memories.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Evaluation
• Cycle accurate flit level simulator.
• Mesh topology – Each router has 5 ports,
NSEW + Injection/Ejection.
• Dimension Ordered Routing (DOR) – decouple
effects of routing algorithm on network
performance.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Evaluation – Traffic traces
• 3 Synthetic traffic traces:
– Uniform
– Bit Complement (Complement)
– Tornado
• Real traffic/memory traces from running
multiple threads (49 threads ⇒ 7x7 Mesh) of
eight SPLASH-2 benchmarks:
– Complex 1D FFT, LU decomposition, Water-
nsquared, Water-spatial, Ray tracer, Barnes-Hut,
Integer Radix sort, Ocean simulation.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Performance on Uniform traffic
A throughput gap of just 9%
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Performance on Complement traffic
A throughput gap of just 4%
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Performance on Tornado traffic
A throughput gap of just 8%
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Performance of DSB on SPLASH-2 benchmarks
Performance of DSB is very close to an OBR with same number of pipeline stages.
Small difference in packet latency between OBR and DSB routers is mainly due to the limited buffering in the DSB router.
Huge performance improvements over IBR in traces exhibiting high contention and demanding high bandwidth.
97%
64%72%
Raytrace, Barnes and Ocean traces have very little contention.For these traces, IBR has lower latency because of a shorter pipeline.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Input Buffered Router (IBR) pipeline
RCVASA
ST LT
Route ComputationDetermine the output port of the flit based on the
destination coordinates.
Virtual Channel AllocationReserve an output Virtual Channel (buffering) at the next
hop router.
Switch ArbitrationAcquire access to the output port through the crossbar.
Switch TraversalTraverse the crossbar to reach the output link.
Link TraversalTraverse the link to reach the input buffer of the next hop
router.
Input 1
Input 2
utput 1
Output 2
Crossbar
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Distributed Shared-Buffer Router pipeline
RC
TS
CR XB1 +MM_WR
LT
Route ComputationDetermine the output port of the flit based on the destination
coordinates.
Timestamp AllocationAssign a timestamp to a flit for the output port requested.Timestamp is the future time (cycle) at which the flit can
depart the middle memory buffer.
Conflict Resolution + Virtual Channel AllocationConflict Resolution: Find a conflict free middle memory.
Virtual Channel Allocation: Reserve a virtual channel at the input of the next hop router.
Crossbar 1 + Middle Memory WriteFlit traverses the first crossbar and gets written into the
assigned middle memory.
Middle Memory Read + Crossbar 2When the current time equals the timestamp, the flit is read
from the middle memory and traverses the second crossbar.
VA
MM_RD + XB2
Link TraversalFlit traverses the output link to reach the input buffer of the
next-hop router.
Input 1
Input 2
Output 1
Output 2
Crossbar 1 Crossbar 2Middle Memory
If CR or VA fails
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Higher throughput – At what cost?Extra power !!
RC
TS
CR XB1 +MM_WR
LTVA
MM_RD + XB2
Input 1
Input 2
Output 1
Output 2
Crossbar 1 Crossbar 2Middle Memory
TS stage instead of Switch Arbitration in IBRsExtra stage for Conflict ResolutionMiddle memory buffers – Can have fewer input buffers to compensate for extra middle memory buffers.
Two crossbars instead of one: With N middle memories, need one PxN and one PxN crossbar.
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Power-Performance tradeoff
• Theoretically, 2P-1 middle memories needed
to resolve all conflicts.
• For a 5-port mesh router, need > 9 middle
memories, a 5x9 and a 9x5 crossbar – large
power overhead.
• We used 5 middle memories and two 5x5
crossbars and saw almost no difference on
performance but some reduction in power
NoCs 2010, Grenoble, FranceMay 4, 2010
Power and Area Comparison
Router power overhead of 50% for DSB-5 routerIf NoC consumes 10% of tile power, tile power overhead of only 3.5% for DSB-5 router
If NoC consumes 20% of tile power, tile power overhead of only 7% for DSB-5 router
NoCs 2010, Grenoble, FranceMay 4, 2010
Conclusions
• DSB offers better throughput than IBR with
comparable buffering
• DSB can emulate OBR and offer near-theoretical
throughput with limited buffering
• Simulations with synthetic and real-application
traces prove benefits of DSB
• Downside is the larger CMOS area and relatively
increased NoC energy expenditure
NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh
Thank you
• Questions?