22.04.2015
Centralized and software-based run-time traffic management inside configurable regions of interest in mesh-based Networks-on-Chip
Philipp Gorski, Tim Wegner, Dirk Timmermann
Institute of Applied Microelectronics and Computer Engineering
University of Rostock
Outline Fundamentals and trends
Chip-Multi-Processor
Network-on-Chip
Quadrant-based mesh (QMesh)
Overview
Traffic management
Experiments
Setup & flow
Results
Conclusion
22.04.2015 2
Fundamentals and trends - CMPs Modern Chip-Multi-Processors (CMPs)
Modular design
High degree of parallelization (core/thread level)
Challenge: efficient on-chip communication despite rising core count
22.04.2015 3
computation IP
local
infrastructure
I/O
I/O global
exte
rna
l mem
ory
on-chip components
IP
local
IP
local
IP
local communication
memory
system I/O
P2P, busses, on-chip networks
GPP, GPU, DSP,…
L1/L2 cache, regs, FIFO, …
L3/L4 cache, eDRAM, …
Fundamentals and trends - CMPs
Key trends an issues
Starting point: on-chip communication infrastructure (here: Networks-on-Chip)
22.04.2015 4
IP cores # increases Heterogeneous On-chip memory
On-chip communi-cation dominant
Application diversity Multiple domains Latency: BW, comp. Virtualization
Interferences Memory access Communication
Utilization wall Power/temperature Bandwidth (BW) Dark silicon
PVT variation increases Reliability
Architecture Workload Technology scaling
0,0 1,0 2,0 3,0
0,1 1,1 2,1 3,1
0,2 1,2 2,2 3,2
0,3 1,3 2,3 3,3
(xSRC ,ySRC)
(xDST ,yDST)
Fundamentals and trends - NoCs Networks-on-Chip (NoCs): approach for scalable on-chip communication
Packet-based communication, GALS principle
Topology: interconnection between components
Path: E2E packet route through NoC determined by routing algorithm
22.04.2015 5
Router (R)
Link
Network inter-face (NI)
Tile: voltage & frequency island
Dimension-ordered XY/YX routing
Minimal path length
Deterministic
Dead-/livelock-free
Minimal HW effort
Non-adaptive
4x4 2D-mesh
QMesh - overview Idea: improve IP core connectivity
Increase number of NIs per IP core (Q0 – Q3)
Connect each core to all surrounding routers
QMesh: quadrant-based mesh + XY routing
Dual-path routing
Spatially independent paths
22.04.2015 6
IP core
Q3 Q0
Q1Q2
R
North
South
EastWest
QMesh - overview DST at same row (L, R) or column (U, D)
22.04.2015 7
DST in quadrant (Q0 – Q3)
L R
U
D
IP IP IP IP
IP IP IP IP IP
IP SRC IP
IP IP IP IP IP
IP DST IP IPQ3
Q2
Q0
Q1
Q2 Q1
Q0Q3
IP IP
IP IP IP IP IP
IP IP SRC IP IP
IP IP IP IP IP
IP DSTQ3
Q2
Q0
Q1
L, R, U, D: path length reduced by 1 hop (compared to 2D-mesh)
Q0 – Q3: path length reduced by up to 2 hops (worst case: same as 2D-mesh)
QMesh - traffic management Periodic algorithm for dynamic adaptation of routing paths (re-)balancing
Performed in entire NoC or Region Of Interest (ROI)
Centralized calculation on Master Tile (MT)
Sensing: by local HW counters measure activity (routers, links, NIs)
Evaluation & update: by SW on MT choose path with smallest workload
22.04.2015 8
ROI
IP IP IP IP IP
IP IP IP IP IP
IP MT IP IP IP
IP IP IP IP IP
IP IP IP IP IP
Activity sensing
Traffic evaluation
Path update
Simulation flow Sniper: IP core simulation SW timing
SystemC simulator: NoC activity statistics
DSENT/HOTSPOT: provision of power/temperature calculation profiles
Post processing: activity power temperature wear-out
22.04.2015 9
post-processing
DSENT HOTSPOT
NoC simulator
Sniper software
timing
power = f(activity) temperature = f(power)
activity statistics
NoC setup
calculation functions
Experimental setup Tools:
NoC simulator: SystemC-based, cycle-accurate
Sniper, DSENT, HOTSPOT (third-party)
Parameters
2D-mesh (XY routing) and QMesh (XY routing + TM + PA)
8x8 NoC (1GHz frequency), 9 flit FIFO depth (router), 64 bit link width
Synthetic traffic patterns
Single-threaded applications (bit complement/reverse, transpose, shuffle)
Multi-threaded applications (nearest neighbor, hotspot, rentian)
Evaluated parameters
Packet delay (∆DELAY) vs. power overhead (∆POWER)
Reliability: wear-out acceleration factor 𝑎𝑀𝑇𝑇𝐹
Network saturation margin
22.04.2015 10
Increased power consumption but reduced packet delay
Locality/fewer hops & dynamic path adaptations
Reduced & balanced traffic
Lower & evenly distributed activity
reduced Pdyn & thermal hot spots
∆POWER lower than expected (~100%)
Experimental results – power and delay
22.04.2015 11
Experimental results - reliability
General wear-out decrease through QMesh
Increase of mean router lifetime: 10% for low PIR, 60% for high PIR (avg. ~ 35%)
22.04.2015 12
𝒂𝑴𝑻𝑻𝑭 =𝑴𝑻𝑻𝑭𝑸𝑴𝒆𝒔𝒉
𝑴𝑻𝑻𝑭𝟐𝑫−𝒎𝒆𝒔𝒉
wear-out increase : 𝑎𝑀𝑇𝑇𝐹 < 1
wear-out decrease: 𝑎𝑀𝑇𝑇𝐹 > 1
Conclusions Modern CMPs require efficient architecture for on-chip communication
NoCs provide appropriate infrastructure
QMesh topology: integration of multiple NIs per IP core to improve connectivity
Preservation of basic NoC structure and associated benefits
Improvements over standard 2D-mesh
Increase of network saturation margin
Reduction of avg. packet delay
Reliability: increased router lifetime due to lower max. temperatures
Higher robustness due to dual-path routing (spatially independent)
Tolerable costs
Traffic monitoring (HW) and path adaptations (SW) for QMesh at runtime
Dynamic traffic (re-)balancing & hotspot reduction
22.04.2015 13
22.04.2015 14
Thank you for your attention!
Questions?
QMesh - overview QMesh characteristics
Preservation of basic 2D-mesh structure
Dual-path routing
Spatially independent paths
Required modifications / additional HW costs
8-ported router
4 NIs per IP core
1 programmable Path Table (PT) per IP core
4 bit addressing extension (for Qin and Qout)
Advantages
Costs comparable to 2D-mesh with XY/YX routing
Reduced average path length
Mitigation of traffic interferences
Increased traffic locality
Benefits of 2D-mesh maintained (e.g. deterministic routing)
15 College of Computer Science and Electrical Engineering
Institute of Applied Microelectronics and Computer Engineering
IP coreNI Q3 NI Q0
NI Q1PT
Q0
Right
Q1Q2
Q3
Left
Up
Do
wn
RR
RR
0011...
01
Qout
0011...
01
QIN
QMesh tile
Qin
Q0 = 11
1010
Q1 = 10 Q2 = 00 Q3 = 01
4 bits
Processing element
CX
NI Q2
Traffic Management – Evaluation and Update SNoC: transmission of monitoring data
Evaluation done by master tile (MT)
Basically: choose path with smallest workload balancing
SNoC: transmission of update to path tables data
22.04.2015
Experimental results - reliability Evaluation via acceleration factor of Mean-Time-To-Failure (MTTF): 𝑎𝑀𝑇𝑇𝐹
Wear-out increase: 𝑎𝑀𝑇𝑇𝐹 < 1
Wear-out decrease: 𝑎𝑀𝑇𝑇𝐹 > 1
17
𝒂𝑴𝑻𝑻𝑭 =𝑴𝑻𝑻𝑭𝑸𝑴𝒆𝒔𝒉
𝑴𝑻𝑻𝑭𝟐𝑫−𝒎𝒆𝒔𝒉= 𝒆
𝑬𝒂𝒌
∙𝟏
𝑻𝑸𝑴𝒆𝒔𝒉 –
𝟏𝑻𝟐𝑫−𝒎𝒆𝒔𝒉
𝑡𝑄𝑀𝑒𝑠ℎ , 𝑡2𝐷−𝑚𝑒𝑠ℎ: MTTF of QMesh/2D-mesh
𝑇𝑄𝑀𝑒𝑠ℎ , 𝑇2𝐷−𝑚𝑒𝑠ℎ: avg. router temperature for QMesh/2D-mesh
k: Boltzmann’s constant (8.6×10-5 eV/K)
𝐸𝑎: activation energy of the CMOS devices (here: 0.7 eV at 45nm CMOS)
22.04.2015
∆SAT: relative improvement of network saturation margin
Due to hop reduction and dual-path options
Experimental results – network saturation
22.04.2015 18