+ All Categories
Home > Documents > F.Gilabert †, D.Ludovici § , S.Medardoni ‡, D.Bertozzi ‡, L.Benini †† , ...

F.Gilabert †, D.Ludovici § , S.Medardoni ‡, D.Bertozzi ‡, L.Benini †† , ...

Date post: 04-Jan-2016
Category:
Upload: zahir-melendez
View: 21 times
Download: 1 times
Share this document with a friend
Description:
Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints. F.Gilabert †, D.Ludovici § , S.Medardoni ‡, D.Bertozzi ‡, L.Benini †† , G.N.Gaydadjiev § ‡University of Ferrara. †† University of Bologna. †Universidad Politecnica de Valencia. - PowerPoint PPT Presentation
Popular Tags:
40
Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§ ‡University of Ferrara. ††University of Bologna. †Universidad Politecnica de Valencia. §Delft University of Technology,.
Transcript
Page 1: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Designing Regular Network-on-Chip Topologies under Technology,

Architecture and Software Constraints

F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§

‡University of Ferrara.††University of Bologna.†Universidad Politecnica de Valencia.§Delft University of Technology,.

Page 2: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Multi-dimension topologies2D mesh frequently used for NoC design - perfectly matches 2D silicon surface- high level of modularity- controllability of electrical parameters

But its avg latency and resource consumption scale poorly with network size

Topology with more than 2 dimensions attractive: - higher bandwidth and lower avg latency- on-chip wiring more cost-effective than off-chip

But physical design issues might impact their effectiveness and even feasibility(decreased operating frequency)

(higher link latency)

Page 3: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies

Under realistic technological constraintsUnder realistic technological constraints

1. Physical synthesis impact over performance

Over-the-cell routing?

Latency in injection

links?

Latency in express

links?

Which switch

operating frequency?

Regularity broken by asymmetric

tile size or heterogeneous

tiles!

Our approachPhysical parameters from the physical synthesis are applied to system-level

simulationsSilicon-aware performance analysis

Page 4: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies

Under realistic architectural constraintsUnder realistic architectural constraints

Our approach•Chip I/O interface modeling•Capture the implications of I/O performance on topology performance differentiation

1. Physical synthesis impact over performance2. Impact of chip I/O interface over topology performance

May introduce an upper bound to the topology performance, affecting the performance differentiation between topologies

Page 5: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies

Software constraints: communication semantics of the middlewareSoftware constraints: communication semantics of the middleware

Traffic pattern usually abstracted as an average link

bandwidth utilization or as a synthetic

traffic pattern

May lead to highly inaccurate performance predictions(traffic peaks, different kinds of messaging, synchronization mismatches)

Our approach• Project network traffic based on latest advances in MPSoC communication middleware• Generate traffic patterns for the NoC “shaped” by the above communication middleware (e.g., synchronization, communication semantics)

1. Physical synthesis impact over performance2. Impact of chip I/O interface over topology performance3. Realistically capture traffic behavior

Page 6: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Page 7: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topology generation

Topology specification

RTL SystemC/Verilog

Simulation

VCD Trace

Physical Synthesis

PlacementFloorplan

Clock Tree Synth., Power Grid, routing, post-routing opt

Netlist, Parasitic Extraction Prime time

SDF (timing)Prime timePower estimation

OCP Traffic Generator

Transactional Simulator

Page 8: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Backend synthesis flow

Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Page 9: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Tile Architecture

• Processor core – Connected through a Network Interface Initiator

• Local memory core – Connected through a Network Interface Target

• Two network interfaces can be used in parallel

ProcessorCore

MemoryCore

Network IF Initiator

Network IF Target

Tile

Page 10: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Communication protocol• Step 1: Producer checks

local semaphores for pending messages for the destination• If not, it writes data to

the local tile memory and unblocks a semaphore at the consumer tile

• The producer is free to carry out other tasks

Local Polling

Producer Tile

WriteMessage

Reset Semaphore

Local Polling

ConsumerTile

ReadOperation

1

2

3

4

• Step 2: Consumer detects unblocked semaphore• Requests producer for

data

• Step 3: Consumer reads data from the producer

• Step 4: Consumer sends a notification upon completion– This allows the producer

to send another message to this consumer

• Message sent only when consumer is ready to read it • Only one outstanding message for a producer-consumer pair•Low network bandwidth utilization•Tight latency constraints on the topologyDalla Torre, A. et al., ”MP-Queue: an Efficient Communication Library for Embedded Streaming Multimedia Platform”, IEEE Workshop on Embedded Systems for Real-Time Multimedia, 2007.

Page 11: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Backend synthesis flow Communication semantics

Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Page 12: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topologies Under Test – 16 tiles4-ary 2-mesh

(2D Mesh)

Switches 16

Bis. Band. 4

Tiles x Switch 1

Switch Arity 6

Max. Hops 6

4-ary 2-meshBaseline Topology

TileSwitch

Page 13: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topologies Under Test – 16 tiles4-ary 2-mesh

(2D Mesh)2-ary 4-mesh(Hypercube)

Switches 16 16

Bis. Band. 4 8

Tiles x Switch 1 1

Switch Arity 6 6

Max. Hops 6 4

4-ary 2-meshBaseline Topology

2-ary 4-meshHigh Bandwith

TileSwitch

TileSwitch

Page 14: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topologies Under Test – 16 tiles

4-ary 2-meshBaseline Topology

2-ary 2-meshLow latency

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated

)

Switches 16 16 4

Bis. Band. 4 8 2

Tiles x Switch 1 1 4

Switch Arity 6 6 10

Max. Hops 6 4 2Tile

Switch

TileSwitch

Page 15: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topologies Under Test – 64 tiles8-ary 2-mesh

(2D Mesh)

Switches 64

Bis. Band. 8

Tiles x Switch 1

Switch Arity 8

Max. Hops 14

8-ary 2-meshBaseline Topology

Page 16: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topologies Under Test – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

Switches 64 64

Bis. Band. 8 32

Tiles x Switch 1 1

Switch Arity 6 8

Max. Hops 14 6

8-ary 2-meshBaseline Topology

2-ary 6-meshHigh Bandwith

Page 17: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topologies Under Test – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

2-ary 4-mesh(Concentrated)

Switches 64 64 16

Bis. Band. 8 32 8

Tiles x Switch 1 1 4

Switch Arity 6 8 12

Max. Hops 14 6 4

8-ary 2-meshBaseline Topology

2-ary 4-meshLow Latency

Page 18: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Backend synthesis flow Communication semantics Topologies under test

Physical synthesis Layout-aware topology performance Conclusions

Page 19: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis• Link latency and maximum frequency

– Performance, area and power – Quantified by post-layout analysis

• For 16 tile systems– Real physical parameter values were obtained

• For 64 tile systems– Physical parameter values extrapolated based on

16 tiles results– Synthesis time constraints

Page 20: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis – 16 Tiles

• Network building blocks synthesized for maximum performance

• Timing path in network logic– Ignore switch-to-switch links.

• Critical paths are in the switches – never in the network interfaces– Network speed closely reflects the maximum switch radix

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated)

Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2

Page 21: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis – 16 Tiles

• Inter-switch wiring reduces performance • The connectivity pattern of 2-ary 4-mesh results into a

larger frequency drop than the 2D mesh• The 2-ary 2-mesh pays its lower number of switching

resources with a larger switch-to-switch separation– Severe degradation of network performance

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated)

Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2

Page 22: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis – 16 Tiles

• Frequency-ratioed clock domain crossing in network interface– Network speed affects core speed.

• Maximum core speed of 500 MHz is assumed • Post-layout speed drop

– Cores cannot sustain the network speed – A divider of 2 is applied

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated)

Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2

Page 23: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis – 16 Tiles

• 2-ary 4-mesh larger area footprint than the 2D mesh

• 2-ary 2-mesh reduces the number of switches– Larger radix– Area not halved

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated)

Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2

Page 24: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis – 64 tiles• 64 tile hypercubes present very long links– Switch-to-switch link delay impacts overall network speed– Overall network speed unacceptably low for 64 tiled

systems

• Link pipelining becomes mandatory– Allows to sustain network speed even in the presence of

long links

• Number of pipeline stages depends on the link length on the layout

Page 25: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Concentrated 2-ary 4-mesh

Physical Synthesis – 64 tiles

Page 26: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

2-ary 4-mesh(Concentrated)

Switch Arity 6 8 12

Post-synthesis freq. 1 Ghz 900 Ghz 790 Mhz

Post-layout freq. 786 MHz 640 Mhz 500 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 250 Mhz

Cell Area 4461kμm2 7356k μm2 2610k μm2

Latency on top dimensions

Dimension 3 - 1 1

Dimension 4 - 1 2

Dimension 5 - 2 -

Dimension 6 - 3 -

Page 27: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

2-ary 4-mesh(Concentrated)

Reduced 2-ary 4-mesh

Switch Arity 6 8 12 12

Post-synthesis freq. 1 Ghz 900 Ghz 790 Mhz 790 Mhz

Post-layout freq. 786 MHz 640 Mhz 500 Mhz 500 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 250 Mhz 500 Mhz

Cell Area 4461kμm2 7356k μm2 2610k μm2 2610k μm2

Latency on top dimensions

Dimension 3 - 1 1 1

Dimension 4 - 1 2 2

Dimension 5 - 2 - -

Dimension 6 - 3 - -

Page 28: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Physical Synthesis – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

2-ary 6-meshHigh-Speed

Switch Arity 6 8 8

Post-synthesis freq. 1 Ghz 900 Ghz 900Mhz

Post-layout freq. 786 MHz 640 Mhz 786 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 393Mhz

Cell Area 4461kμm2 7356k μm2 22784k μm2

Latency on top dimensions

Latency on top dimensions

Dimension 3 - 1 2

Dimension 4 - 1 2

Dimension 5 - 2 2

Dimension 6 - 3 3

Aggressive link pipelining200% area overhead for 20% improvement in performanceNot usable

Page 29: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Backend synthesis flow Communication semantics Topologies under test Physical synthesis

Layout-aware topology performance Conclusions

Page 30: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Workload distribution

• Producer, worker and consumer tasks• I/O devices dedicated to input OR output data

External I/O

Page 31: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topology performance• 1 Input and 1 Output ports to the external memory

are assumed for 16 tile systems• 4 Input and 4 Output ports to the external memory

are assumed for 64 tile systems• I/O ports are accessed through sidewall tiles

– The mapping of producer(s) and consumer(s) tasks is therefore constrained to these tiles

Page 32: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topology performance• Several I/O mapping strategies were considered:• For sake of space, we only show here the most

significative– OneSided: all the I/O tiles are placed on the same side of

the chip.

Page 33: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topology performance - 16 tiles

• 2-ary 4-mesh reduces total number of cycles by 27.4%• 2-ary 2-mesh reduces cycles only by 1.6% over the hypercube

– Chip I/O becomes the bottleneck• Real operating frequency of each topology changes conclusions

– Physical degradation is too severe to be compensated• 2-ary 2-mesh shows superior energy saving properties

– 50% over the 2D mesh

Page 34: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topology performance - 64 tiles

• 2D mesh outperforms the non-reduced hypercubes• Systems under test are I/O constrained

– Computation tiles spend around 50% of their time waiting to send data to the consumer tile– Upper bound to topology-related performance optimization

• Improvement in terms of execution cycles– Performance improvements in cycles are not such to offset the lower operating speed

Removal of the I/O bottleneck has to be considered as mandatory to achieve performance differentiation between topologies

Page 35: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Topology performance - 64 tiles

• Network and tiles work at the same frequency– Maximum frequency for all tiles: I/O tiles and processing tiles.• Very similar performance– Reduced number of cycles– Low network frequency

• Reduced hardware resources – 4 times less switches, half the number of ports and works at half the

frequency

Page 36: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance

Conclusions

Page 37: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Conclusions Bottom-up approach to assess k-ary n-mesh

topologies A number of real life issues are considered:

Physical constraints of nanoscale technologies Impact of I/O interface Communication semantics of the middleware

The intricate wiring of multi-dimension topologies or the long links required by concentrated k-ary n-meshes can be changed into 2 different kind of performance overhead by means of proper design techniques:

Page 38: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Conclusions

Operating frequency reduction: in spite of a lower number of execution cycles, multi-dimension topologies loose in terms of RET due to lower working frequency. Concentrated topologies provide a way to trade

performance for power/area

Increase of link latency: the utilization of retiming stages allows to sustain operating frequency while increasing the network latency. Area and power overhead have to be taken into account Link pipelining can not materialize a frequency

higher than the switch radix itself for 64 tile systems we found that in general, the 2D

mesh outperforms the hypercubes. In spite of a better execution cycles, the real elapsed

time is worst because of a lower operating frequency

Page 39: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Conclusions Unexpected results for the reduced 2-ary 4-mesh:

Expected: Low cost - Low performance solution Results: Low cost with similar performance as

2D mesh Increment in core speed allows to reduce the

impact: I/O tile congestion Processing tiles

Possible solution to hypercube physical degradation issues: Decouple network speed from core speed

(GALS) Other solutions:

High performance – high radix switches

Page 40: F.Gilabert †,  D.Ludovici  § ,  S.Medardoni ‡,  D.Bertozzi ‡,  L.Benini †† ,  G.N.Gaydadjiev §

Designing Regular Network-on-Chip Topologies under Technology,

Architecture and Software Constraints

F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§

‡University of Ferrara.††University of Bologna.†Universidad Politecnica de Valencia.§Delft University of Technology,.