+ All Categories
Home > Documents > Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010...

Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010...

Date post: 09-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
30
NoCs 2010, Grenoble, France May 4, 2010 Ramanujam, Soteriou, Lin, Peh Design of a High-Throughput Distributed Shared-Buffer NoC Router Rohit Sunkam Ramanujam*, Vassos Soteriou , Bill Lin*, Li-Shiuan Peh *Dept. of Electrical Engineering, UCSD, USA Dept. of Electrical Engineering, CUT, Cyprus Dept. of Electrical Eng. and Computer Science, MIT, USA
Transcript
Page 1: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Design of a High-Throughput

Distributed Shared-Buffer NoC Router

Rohit Sunkam Ramanujam*, Vassos Soteriou†,

Bill Lin*, Li-Shiuan Peh‡

*Dept. of Electrical Engineering, UCSD, USA†Dept. of Electrical Engineering, CUT, Cyprus

‡Dept. of Electrical Eng. and Computer Science, MIT, USA

Page 2: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Chip Multiprocessor

Uniprocessor

Chip Multiprocessors are a reality

Sources: Intel Inc. and Tilera Inc.

Page 3: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

The need for a Network on Chip (NoC)

• Scalable communication• Modular design• Efficient use of wires• A new way to organize and build VLSI

systems

Compute Unit

Router

Page 4: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

The Problem – Delivering high throughput in NoCs

• Why Care?

– NoCs in CMPs connect general-purpose processors.

– Future applications unknown → traffic unknown.

– Exploiting parallelism needs fine-grained interaction

between cores.

– Can expect high traffic volume for current and

future applications running on many-core

processors.

– E.g. Cache coherence between large number of

distributed shared L2 caches.

Page 5: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

An important design choice that

affects throughput

• Router microarchitecture

– How well does a router multiplex packets onto its

output links?

Page 6: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

NoC routers – Current design

cycle = 1cycle = 2

Input 1

Input 2

Output 1

Output 2

Maximal Matching: Input 1 → Output 1Maximal Matching: Input 2 → Output 1

Output 2 is unutilized in cycle 3 although there is a flit destined for output 2. Bottleneck: Maximal matching used for arbitration is not good enough.

(70-80% efficiency)

Input Buffered Routers (IBRs) – Flits buffered at the input ports

cycle = 3

Crossbar

Page 7: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Output queueing to the rescue …

cycle = 1cycle = 2cycle = 3

Input 1

Input 2

Output 1

Output 2

Output links are always utilized when there are flits available.Better multiplexing of flits onto output links ⇒ higher throughput.

Crossbar

Output buffered router (OBR) – Flits buffered at the output ports

Page 8: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010

How much difference does it make?

A throughput gap of 18%!

Uniform Traffic

Ramanujam, Soteriou, Lin, Peh

Page 9: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010

A throughput gap of 12%!

Complement Traffic

How much difference does it make?

Ramanujam, Soteriou, Lin, Peh

Page 10: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010

A throughput gap of 22%!

Tornado Traffic

How much difference does it make?

Ramanujam, Soteriou, Lin, Peh

Page 11: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Performance impact on real applications

Up to 98% reduction in average packet latency

Splash 2 benchmark applications

Page 12: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Output Buffering is great …

• OBRs offer much higher throughput than IBRs.

• OBRs have predictable delay.

– Queuing delay modeled using M/D/1 queues.

• Packet delays not predictable for IBRs.

Page 13: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

So why aren’t OBRs used in NoCs ?

• Implementing Output Buffering requires either:

– Crossbar speedup of P, where P is the number of ports.

Not practical for aggressively clocked designs.

– Output buffers with P write ports and a PxP2 crossbar.

Has huge area and power penalties.

.

.

.

Input 2

Input P-1

Input 1Output 1

.

.

.

Output P-1

Crossbar

Page 14: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Our approach: Emulate Output

Queueing without any speedup

Input 1

Input 2

Input 3

Crossbar 1 Crossbar 2Middle Memories

Output 1

Output 2

Output 3

Step1: Timestamp the flitsAssign a future time at which a

flit would depart the router assuming output buffering.

5

6

Step2: Find a conflict-free middle memory. Step4: When current time == timestamp,

Read flit from middle memory to output port.

Current time = 1Current time = 1Current time = 2Current time = 2Current time = 3Current time = 3Current time = 5Current time = 5Current time = 6Current time = 6

4

Step3: Move flits from input buffers to middle memories.

Current time = 4Current time = 4

Page 15: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Arrival and Departure Conflicts

• Arrival Conflicts – With P input ports, a flit can

have an arrival conflict with P-1 other flits.

• Departure Conflicts – With P output ports, a flit

can have a departure conflict with P-1 other

flits.

• By Pigeon hole principle, 2P-1 middle memories

needed to avoid all arrival and departure

conflicts.

Page 16: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

We Propose: The Distributed

Shared-Buffer Router (DSB)

• Aims at emulating the packet servicing scheme of an OBR with limited buffers and no speedup.

– First-Come-First-Served servicing of flits.

Objectives:– Close the performance gap between OBRs with infinite

buffers and IBRs (high throughput).– Make a feasible design → low power and area overhead.– Make packet delays more predictable for delay sensitive

NoC applications.

Page 17: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

DSB Router

Innovations

– Router pipeline with new stages for:

• Timestamping flits

• Finding a conflict free middle memory

– Complexity and delay-balanced pipeline stages for a

high-clocked, high-performance implementation.

– New flow control to prevent packet dropping when

resources are unavailable.

– Evaluate power-performance tradeoff of DSB

architectures with fewer than 2P-1 middle memories.

Page 18: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Evaluation

• Cycle accurate flit level simulator.

• Mesh topology – Each router has 5 ports,

NSEW + Injection/Ejection.

• Dimension Ordered Routing (DOR) – decouple

effects of routing algorithm on network

performance.

Page 19: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Evaluation – Traffic traces

• 3 Synthetic traffic traces:

– Uniform

– Bit Complement (Complement)

– Tornado

• Real traffic/memory traces from running

multiple threads (49 threads ⇒ 7x7 Mesh) of

eight SPLASH-2 benchmarks:

– Complex 1D FFT, LU decomposition, Water-

nsquared, Water-spatial, Ray tracer, Barnes-Hut,

Integer Radix sort, Ocean simulation.

Page 20: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Performance on Uniform traffic

A throughput gap of just 9%

Page 21: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Performance on Complement traffic

A throughput gap of just 4%

Page 22: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Performance on Tornado traffic

A throughput gap of just 8%

Page 23: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Performance of DSB on SPLASH-2 benchmarks

Performance of DSB is very close to an OBR with same number of pipeline stages.

Small difference in packet latency between OBR and DSB routers is mainly due to the limited buffering in the DSB router.

Huge performance improvements over IBR in traces exhibiting high contention and demanding high bandwidth.

97%

64%72%

Raytrace, Barnes and Ocean traces have very little contention.For these traces, IBR has lower latency because of a shorter pipeline.

Page 24: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Input Buffered Router (IBR) pipeline

RCVASA

ST LT

Route ComputationDetermine the output port of the flit based on the

destination coordinates.

Virtual Channel AllocationReserve an output Virtual Channel (buffering) at the next

hop router.

Switch ArbitrationAcquire access to the output port through the crossbar.

Switch TraversalTraverse the crossbar to reach the output link.

Link TraversalTraverse the link to reach the input buffer of the next hop

router.

Input 1

Input 2

utput 1

Output 2

Crossbar

Page 25: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Distributed Shared-Buffer Router pipeline

RC

TS

CR XB1 +MM_WR

LT

Route ComputationDetermine the output port of the flit based on the destination

coordinates.

Timestamp AllocationAssign a timestamp to a flit for the output port requested.Timestamp is the future time (cycle) at which the flit can

depart the middle memory buffer.

Conflict Resolution + Virtual Channel AllocationConflict Resolution: Find a conflict free middle memory.

Virtual Channel Allocation: Reserve a virtual channel at the input of the next hop router.

Crossbar 1 + Middle Memory WriteFlit traverses the first crossbar and gets written into the

assigned middle memory.

Middle Memory Read + Crossbar 2When the current time equals the timestamp, the flit is read

from the middle memory and traverses the second crossbar.

VA

MM_RD + XB2

Link TraversalFlit traverses the output link to reach the input buffer of the

next-hop router.

Input 1

Input 2

Output 1

Output 2

Crossbar 1 Crossbar 2Middle Memory

If CR or VA fails

Page 26: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Higher throughput – At what cost?Extra power !!

RC

TS

CR XB1 +MM_WR

LTVA

MM_RD + XB2

Input 1

Input 2

Output 1

Output 2

Crossbar 1 Crossbar 2Middle Memory

TS stage instead of Switch Arbitration in IBRsExtra stage for Conflict ResolutionMiddle memory buffers – Can have fewer input buffers to compensate for extra middle memory buffers.

Two crossbars instead of one: With N middle memories, need one PxN and one PxN crossbar.

Page 27: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Power-Performance tradeoff

• Theoretically, 2P-1 middle memories needed

to resolve all conflicts.

• For a 5-port mesh router, need > 9 middle

memories, a 5x9 and a 9x5 crossbar – large

power overhead.

• We used 5 middle memories and two 5x5

crossbars and saw almost no difference on

performance but some reduction in power

Page 28: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010

Power and Area Comparison

Router power overhead of 50% for DSB-5 routerIf NoC consumes 10% of tile power, tile power overhead of only 3.5% for DSB-5 router

If NoC consumes 20% of tile power, tile power overhead of only 7% for DSB-5 router

Page 29: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010

Conclusions

• DSB offers better throughput than IBR with

comparable buffering

• DSB can emulate OBR and offer near-theoretical

throughput with limited buffering

• Simulations with synthetic and real-application

traces prove benefits of DSB

• Downside is the larger CMOS area and relatively

increased NoC energy expenditure

Page 30: Design of a High-Throughput Distributed Shared-Buffer NoC Router · 2016-01-13 · May 4, 2010 Ramanujam, Soteriou, Lin, Peh NoCs 2010, Grenoble, France Design of a High-Throughput

NoCs 2010, Grenoble, FranceMay 4, 2010 Ramanujam, Soteriou, Lin, Peh

Thank you

• Questions?


Recommended