Date post: | 17-Feb-2017 |
Category: |
Engineering |
Upload: | ashif-sikder |
View: | 243 times |
Download: | 3 times |
EMERGING TECHNOLOGIES IN ON-CHIP AND OFF-CHIP INTERCONNECTION NETWORKS
Md Ashif Iqbal SikderSchool of Electrical Engineering and Computer Science
Ohio University, Athens, OH 45701E-mail: [email protected]
Overview• Motivation and Background
• Heterogeneous Network-on-Chip
• On-Chip and Off-Chip Interconnection Network
• Performance Evaluation and Results
• Conclusions and Future Work
2
Introduction: Network-on-Chip (NoC)
3
Pasricha, Sudeep, and Nikil Dutt. On-chip communication architectures: system on chip interconnect. Morgan Kaufmann, 2010.
CustomShared
BusHierarchical
BusBus
Matrix
NoC
Time
1990 1995 2000 2005 2010
Multi-Cores & Network-on-Chips
• With increasing multiple number of cores, communication-centric design paradigm (Network-on-Chips) is facing challenges due to:• Higher Energy Dissipation: Long metallic wires• Area Overhead: More router components• Increased Latency: Complex multi-hop routing
4
TILE-Mx1001 MPPA-256 Kalray2 GF100 512-Core (Nvidia)3
1http://www.tilera.com/products/?ezchip=585&spage=686 2http://www.kalrayinc.com/kalray/products/ 3http://www.nvidia.com/object/IO_86775.html
Latency in Multi-Core Processor
MC0 MC1
MC3 MC2
Router
Core L1
L2 Bank
Bank 0Bank 1Bank 2
L1
L2REQUESTMESSAGE
RESPONSEMESSAGE
1. L1-L22. L2-Mem
4. Mem-L25. L2-L1
3. Mem
1
2
3
4
5
Sharifi, A.; Kultursay, E.; Kandemir, M.; Das, C.R., "Addressing End-to-End Memory Access Latency in NoC-Based Multicores," Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on , vol., no., pp.294,304, 1-5 Dec. 2012.
5
Off-Chip Memory Module (DRAM)
OFF-CHIPMEMORYACCESS
Energy in Multi-Core Processor
6
=> Potential solution: Emerging technologies such as optics and wireless
S. Borkar, “Exascale computing-a fact or affliction?” Keynote presentation at IPDPS, 2013.
Interconnect Energy
RouterMemoryControllerLink
DynamicandStatic
36%
40%
=> Data movement energy will start to dominate
Optical Network-on-Chip
7
Optical NoC offers several advantages:
• Low energy (~7.9 fJ/bit )• Low latency (~500ps)• High bandwidth (~40 Gbps)• CMOS compatibility
Disadvantages of optical NoC:
• Optical-only crossbar is not scalable for large core networks
• Multi-hop networks with smaller crossbar have increased latency for large core networks
1. Lin Xu; Wenjia Zhang; Qi Li; Chan, J.; Lira, H.L.R.; Lipson, M.; Bergman, K., "40-Gb/s DPSK Data Transmission Through a Silicon Microring Switch," Photonics Technology Letters, IEEE , vol.24, no.6, pp.473,475, March15, 2012.2. Sasikanth Manipatruni, Kyle Preston, Long Chen, and Michal Lipson, "Ultra-low voltage, ultra-small mode volume silicon microring modulator," Opt. Express 18, 18235-18242 (2010).3. J. Cunningham, R. Ho, X. Zheng, J. Lexau, H. Thacker, J. Yao, Y. Luo, G. Li, I. Shubin, F. Liu et al., “Overview of short-reach optical interconnects: from vcsels to silicon nanophotonics.”4. Xia, Fengnian, Lidija Sekaric, and Yurii Vlasov. "Ultracompact optical buffers on a silicon chip." Nature photonics 1.1 (2007): 65-71.
Wireless Interconnection Network
8
• Advantages of Wireless Technology:
• CMOS compatibility• Omnidirectional communication
without wires using multicasting and broadcasting
• Bandwidth extension using Frequency Division Multiplexing (FDM), Time Division Multiplexing (TDM), Space Division Multiplexing (SDM)
• Disadvantages of Wireless Technology:
• High transceiver area and energy/bit• Low wireless bandwidth at a 60 GHz
center frequency for CMOS technology
• Latency due to resource sharing1. D. DiTomaso, A. Kodi, D. Matolak, S. Kaya, S. Laha, and W. Rayess, “Energy-efficient adaptive wireless nocs architecture,” in Networks on Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on. IEEE, 2013, pp. 1–8.
RF-CMOS transceiver trend for WiNoC1
MulticastingBroadcasting
Optical & Wireless NoC: OWN• OWN combines the benefits of photonics and wireless to overcome the disadvantages of each technology• Smaller optical crossbar to provide one hop
communication and reduce area and energy overhead• Connect the optical domains via wireless to facilitate one
hop communication between the domains
9
OpticalDomain
OpticalWaveguide
WirelessAntenna
A Tile consists of 4 Cores => 16 Tiles form a Cluster => 4 Clusters create a Group => 4 Groups are on the Chip
4 Cores 64 Cores 1024 Cores256 Cores
OWN Architecture & Communication (1/4)
10
OWN Architecture & Communication (2/4)
11
InactiveModulators
ActiveModulators
De-modulators
Cluster & Intra-cluster optical communication
Waveguide
ArbitrationWaveguide
OWN Architecture & Communication (3/4)
12
Group & Intra-group wireless communication
TxRx
OpticalRouter
WirelessRouter
Wait for token
OWN Architecture & Communication (4/4)
13
Chip & Inter-group wireless communication
I A
M E
A I
M E
K B
O F
B K
O F
M E
I A
M E
A I
O F
K B
O F
B K
J C
P G
C J
P G
L D
N H
D L
N H
P G
J C
P G
C J
N H
L D
N H
D L
G0 G1
G3G2
Wait for token
Wait for token
Wait for token
NetworkDiameter3 hops
Solution => VC allocation based on packet types which requires 4 VCs per port
OWN Deadlocks & Solution
14
C C1
C C3
DD1
D2 D
G G1
G G3
HH1
H2 H
L
L
L
L
J J
J J
N N
N N
P P
P P
Group 2 Group 3
WirelessRouter
Optical Link
Intra-groupWireless Link
Inter-groupWireless Link
CircularDependency
Core Core
• VC0 = Intra-cluster & Intra-group
• VC1 = Inter-group horizontal• VC2 = Inter-group vertical• VC3 = Inter-group diagonal
Input
Output
Cluster
Extending OWN to R-OWN• Limitations of OWN:
• Less wireless link utilization• Low saturation throughput
• Overcoming OWN’s Limitations:• Reconfigurable wireless link• Proper utilization of wireless technology• High saturation throughput
15
=> Extend OWN to Reconfigurable Optical and Wireless Network-on-Chip (R-OWN)
R-OWN Architecture & Communication
16
H0
D1
V0
V2
H2
H3
V1
V3
D0
D3
H1
D2
LA 0 LA 1LA 3LA 2
AdaptiveWirelessAntennas
FixedWirelessAntennas
Off-Chip Memory Access Limitations
• Increased Energy and Latency: Multiple hop (on-chip) to access off-chip memory
• Hot-Spot: Corner routers are connected to MC• Connection Inflexibility: How to connect distant routers to memory
controller (MC)• Long Trace Length: Increased off-chip memory access latency and
energy cost
17
Memory Controller
DRAMChip
DRAMChip
DRAMChip
Trace length
MC2
MC0
MC3
MC1
=> Potential solution: Emerging technologies such as wireless
18
R0 R1 R2 R3
R4 R5 R6 R7
R8 R9 R10 R11
R12 R13 R14 R15
MC0
MC3
MC
1
MC
2
DR
AM
DRAM
DR
AM
DRAM(On-chip)-(Off-chip)-(Antenna Type)-(BW)M = MetallicW = WirelessO = Omnidirectional AntennaD = Directional AntennaC = Conservative (128 Gbps)A = Aggressive (512 Gbps)
M/W-M/W-X-X Network: 16 Core
– Architectures: OWN, R-OWN, CMesh (wired only), WCube (hybrid-wireless), ATAC (hybrid-optical), and On & Off Chip Networks
– Number of Cores: 1024 (OWN), 256 (R-OWN), 16 (On & Off Chip)
– Network Simulation: Optisim1 (OWN & R-OWN), Multi2Sim2 (On & Off Chip)
– Synthetic Benchmarks: Uniform (UN), Bit-Reversal (BR), Complement (COMP), Matrix Transpose (MT), Perfect Shuffle (PS), and Neighbor (NBR)
– Real Benchmarks: PARSEC 2.1 (Blackscholes)
– Area and Energy Analysis:• Dsent3 to calculate wired link and router area and energy at bulk 45nm
LVT• Optical link area and energy (arbitration and data waveguide, micro-
ring resonators, laser power)• Wireless transceiver area is 0.62 mm2 and energy is 1pJ/bit4
19
1 A. Kodi and A. Louri, “A system simulation methodology of optical interconnects for high-performance computing systems,” J. Opt. Netw, vol. 6, no. 12, pp. 1282–1300, 2007.2 Ubal, Rafael; Jang, Byunghyun; Mistry, Perhaad; Schaa, Dana; Kaeli, David, "The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing." 2012.3 C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic, “Dsent-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling,” in Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on. IEEE, 2012, pp. 201–210.4 D. DiTomaso, A. Kodi, D. Matolak, S. Kaya, S. Laha, and W. Rayess, “Energy-efficient adaptive wireless nocs architecture,” in Networks on Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on. IEEE, 2013, pp. 1–8.
Performance Analysis
CMESH WCUBE OWN ATAC0
102030405060708090
100Router Area Photonic Link Area WireLess Link AreaWired Link Area
mm
2
20
OWN requires about 35.5% less area than ATAC
35.5%
34.14%
1024-Core OWN: Area
21
OWN consumes about 40.2% less energy than WCube
CMESH ATAC OWN WCUBE0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Wireless Link Energy Router Energy Wire Link EnergyPhotonic Energy
pJ/b
it
40.2%
23.2%
1024-Core OWN: Energy per bit
22
OWN lowered the latency by about 67% and 11% from WCube and ATAC respectively
0.001 0.011 0.021 0.031 0.041 0.051 0.0610
100
200
300
400
500
600CMESH WCUBE ATAC OWN
Network Load
Num
ber o
f Cyc
les
67%
1024-Core OWN: Latency
23
OWN outperforms WCube and CMesh on average by about 8% and 28% respectively
UN BR COMP MT PS NBR GM0
0.05
0.1
0.15
0.2
0.25 CMESH WCUBE ATAC OWN
flits
/cyc
le/c
ore
1024-Core OWN: Saturation Throughput
24
R-OWN requires 12.47% and 3.88% less area compared to WCube and CMesh respectively
OWN R-OWN CMesh WCube0
102030405060708090
Router Wireless Link Photonic Link Wired Link
mm
2
256-Core R-OWN: Area
12.47%3.88%
256-Core R-OWN: Energy per bit
R-OWN consumes about 50.56% and 43.70% less energy compared to WCube and CMesh respectively
25
OW
NR-
OW
NCM
esh
WCu
beO
WN
R-O
WN
CMes
hW
Cube
OW
NR-
OW
NCM
esh
WCu
beO
WN
R-O
WN
CMes
hW
Cube
OW
NR-
OW
NCM
esh
WCu
be
UN BR MT PS GM
00.05
0.10.15
0.20.25
0.30.35
0.40.45
0.5Wired Link Wireless Link Router Photonic Link
pJ/b
it
26
R-OWN lowered the latency by about 101.08% from WCube
0.01 0.02 0.03 0.04 0.05 0.06 0.070
100
200
300
400
500
600CMesh WCube OWN R-OWN
Network Load
Num
ber o
f Cyc
les
256-Core R-OWN: Latency
101.08%
27
R-OWN outperforms OWN, WCube, and CMesh on average (GM) by about 13.07%, 27.48%, and 31.08% respectively
UN BR MT PS NBR GM0
0.05
0.1
0.15
0.2
0.25CMESH WCUBE OWN R-OWN
flits
/cyc
le/c
ore
256-Core R-OWN: Saturation Throughput
W-W-D
-A
W-M-O
-A
W-M-D
-A
W-M-D
-C
M-M-X
-X
M-W-O
-A
W-W-D
-C0
2000000000
4000000000
6000000000
8000000000
10000000000
12000000000
14000000000
Seco
nd
On & Off Chip: Execution Time
28
2.70% 12.59%2.29%7.95%8.22%10.91%
On & Off Chip hybrid-wireless architecture, W-W-D-A requires about 10.91% less execution time compared to the baseline M-M-X-X
M-W-O-A W-W-D-C W-W-D-A M-M-X-X W-M-D-C W-M-D-A W-M-O-A0
20
40
60
80
100
120M-On W-On M-Off W-Off Router MC
pJ/B
yte
On & Off Chip: Energy per Byte
29
Geometric Mean:X-W-X-X 22.95 pJ/ByteX-M-X-X 110.35 pJ/Byte79.314% Improvement
W-W-D-A requires about 78.76% less energy per byte compared to M-M-X-X
• OWN– 30.36% less energy/bit than WCube– Higher saturation throughput and lower latency compared
to wired, wireless, and optical networks
• R-OWN– 13.07%, 27.48%, and 31.08% higher saturation throughput
than OWN, WCube, and CMesh respectively
• On and off chip hybrid-wireless architecture– 10.91% less execution time and 78.76% less energy/byte
compared to the baseline architecture
• Future work– Explore design-space for OWN architecture– Explore optical interconnect for off-chip memory access
30
Conclusions and Future Work
Thank You
Questions?