IBM Research
Optical Technologies for Data Communication in Large Parallel Systems
Mark B. Ritter, Yurii Vlasov, Jeffrey A. Kash, and Alan Benner* IBM T.J. Watson Research Center, *IBM Poughkeepsie [email protected]
2
Outline• HPC Performance Scaling and Bandwidth
• Anatomy of a Link
• Electrical and Optical Interconnect Limits
• Promise of Nanophotonic Technology
• Potential Insertion Points
• Summary
3
Performance Scaling Now Driven by Communication
System performance gains no longer principally from lithography-driven uniprocessor performance
Performance gains now from parallelism exploited at chip, system level
BW requirements must scale with System Performance, ~1B/FLOP (memory & network)
Requires exponential increases in communication bandwidth at all levels of the system
• Inter-rack, backplane, card, chip
Chip Performance(Olukotun et al.)
SystemPerformance
4
Bandwidth: the Bane of the Multicore Paradigm:
Logic flops continue to scale faster than interconnect BW
• Constant Byte/Flop ratio with N cores (constant ) means:
Bandwidth(N-core) = N x Bandwidth(single core)
• 3Di (3D integration) will only exacerbate bottlenecks
Assumptions:• 3 GHz clock• ~ 3 IPC• 10 Gb/s I/O
• 1 B/Flop mem• 0.1 B/Flop data• 0.05 B/Flop I/O
Pins per chip
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1 2 4 8 16 32 64 128
Number of Cores
Sig
nal
+ R
efer
ence
Pin
sS
igna
l + R
efer
ence
Pin
s
5
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1 2 4 8 16 32 64 128
Number of Cores
Sig
nal
+ R
efer
ence
Pin
s
Implications of BW Scaling:
Only several generations left before module I/O limit is hit… what sets limits?
Module Escape Bottleneck
Card Escape Bottleneck(already breached)
Chip escape limit, 200m pitch
Module escape, 1mm pitch
Card escape, 8 pair/mm(QCM w/8 Cores…)
Pins per chip
Sig
nal
+ R
efer
ence
Pin
s
6
Anatomy of Communication Links:
Serializer TxFFE Deserializer Rx
DFEElectrical Channel
Serializer LaserDrive
Deserializer RxAmp
Vb2
Fiber Or WG
LVb1
Serializer ModDrive
Deserializer RxAmp
V
Fiber Or WG
DC LaserOff chip
Silicon WG
SiMod
WG
III-VVCSEL
PD
Ge PD
ChipBoundary
ChipBoundary
850nm
1300 or1550 nm
L <~ 1m PCB, few meters cable @ 10 Gb/s
L : cm to 300m
L : cm to km
ELECTRICAL I/O
OFF-CHIP OPTICAL MODULE
INTEGRATED SILICON NANOPHOTONICS
All links have same basic features, the differences are in modulation and detection, and these differences determine power efficiency, distance x bandwidth, and density…
OE ModuleOE Module
7
Electrical Interconnect ModelingIC 1
IC 2
Module 1 Module 2
High-Speed Links(15 to 60 cm)
5 10 15 20 25 30 35 40 450 50
-90
-80
-70
-60
-50
-40
-30
-20
-10
-100
0
freq, GHz
dB(M
egtron6_4to
1_lo
ng_S9_BB4P_2..S(1
,2))
dB(M
egtron6_4to
1_lo
ng_S9_BB4P_2..S(3
,4))
dB(S
(11,2
7))
dB(S
(12,2
8))
Insertion loss (single-ended):
Simulation
versus
Measurement
Module C4 Module BGALGA
Module C4Module BGALGA
PCBtlines
Modeling accuracy confirmed with measurement, model limits of electrical…
8
Electrical Interconnect Limits• Module-to-module on-board limits:
• Off-board (backplane):
– Limits board-to-board bitrates to ~6.4 Gb/s for typical server configurations
• Rack-to-rack – already optical
Megtron60
5
10
15
20
25
30
35
40
15 30 45 60 90 120
Link Distance [cm]
Thro
ughp
ut [G
b/s]
NRZDuobinaryPAM4
NRZ with FFE and DFE (and/or CTLE) best modulation for dense buses.
Achieve 25 Gb/s @ 45cm…
Costly dielectrics for > 25 Gb/s…
9
Optical Interconnect Modeling
• Lowest-power links use optics as “analog repeaters” of signal with no clock recovery
– E-O-E modeling required: jitter adds up over two electrical links, one optical link
Optomodule
Optocard
Lens ArrayLens ArraySLC
Transceiver IC
OESLC
Transceiver IC
OEOE
Top Bottom
Cutout withOEs-on-IC
IC
B G A
Top Bottom
Cutout withOEs-on-IC
IC
B G A
Waveguide
Terabus“Optomodule”Kash et. al.
10
Optical Interconnect Modeling
• Two electrical on-module links, each with ~ 30 GHz media BW
• One WG optical link with ~ 40 GHz media BW
• Assume electrical, optical I/O do not limit link BW, get 26 GHz BW
• Actual link includes electrical, optical I/O BW, drives system BW to < ~18 GHz
- This limits overall EOE link bitrate to ~ 26 Gb/s @ 1 meter
• Our models include EOE link using full dual-Dirac jitter convolution for end-to-end composite link
Base circuit board with optical waveguides & turning mirrors
Lens array
Organic package
CPU CMOS TRX and SLC
OE
50mm Organic Module
TLINETx TxWG
TLINE RxTLINETx Tx RxWG
CPU module OE WG OE module CPU
E O E link
11
20 40 60 80 100 1200
10
20
30
40
Distance [cm]
Ma
x D
ata
Ra
te [G
b/s
]
FFE + DFE
TELL HardwareNo IC Parasitics
Simulations with I/O limitations
No IC parasitics
Electrical and Optical Link Reach
20 40 60 80 100 120 140 1600
10
20
30
40
50
60
70
80
90
100
Distance [cm]
Maxim
um Da
ta Ra
te [G
b/s]
EOE with 10G Terabus OpticsEOE with 20G Terabus Optics20G Terabus Optics OnlyIdeal, Channel Limit Only
Optical WG limit
Predicted link reach @ 25 Gb/s: ~45 cm electrical links (Megtron 6)
~100 cm optical WG links
25 Gb/s
EOE links double the reach (current WG loss)
12
It's All About Bandwidth Escape:
Bandwidth of Elements (Tb/s)
OpticalElectrical
OE Escape46-100
Optical WG64-166
C4 – 90 - 211
Module56 - 112
C4 – 90 - 211
Card 73 - 136
LGA 12 – 23.5
Module56 - 112
LGA escape 17 – 29
P1S2P3S4P5S6P7P8P9
P10S11P12S13P14P15S16P17S18P19S20P21P22S23P24S25P26S27P28
PCB
Chip
Module
LGA connector
Top Bottom
Cutout withOEs-on-IC
IC
B G A
Top Bottom
Cutout withOEs-on-IC
IC
B G A
Multimode optical transceivers could provide ~4x module escape BW
Actual Terabus
Notional Design
Assume 60mm module, 1mm LGA pitch, 62.5μm WG pitch, what is escape BW?
OEs AroundPerimeter
13
Escape Bandwidth ConclusionsFor the high-end of HPC:
• Hit Rack electrical BW limit in early 2000's
• Hitting off-board BW now, P7 IH chose optics for off-board links
• Likely to hit electrical off-module BW limit soon (some packaging “fixes” - larger modules (S21 on module greater than board- tradeoff)
Optical
P7 IH NodeOptical off card
12 Tb/s/Hub
14
P7 IH System Hardware – Node Front View (~1000 Nodes in Blue Waters)
P7 QCM (8x)
Hub Module (8x)
D-Link Optical InterfaceConnects to other Super Nodes
360VDC Input Power Supplies
Water Connection
L-Link Optical InterfaceConnects 4 Nodes to form Super Node
MemoryDIMM’s (64x)
MemoryDIMM’s (64x)
PCIe Interconnect
1m W x 1.8m D x 10cm H
IBM’s HPCS Programpartially supported by
MLC ModuleHub Assembly
PCIe Interconnect
D-Link Optical InterfaceConnects to other Super Nodes
Avago microPODTM All off-node communication optical
15
Integrated Storage– 384 2.5” drives / drawer, 0-6 drawers / rack230 TBytes\drawer (w/ 600GB 10K SAS disks), full RAID, 154 GB/s BW/drawerStorage Drawers replace server drawers at 2-for-1 (up to 1.38 PetaBytes / rack)
Integrated Cooling – Water pumps and heat exchangersAll thermal load transferred directly to building chilled water – no load on room
Integrated Power Regulation, Control, & DistributionRuns off any building voltage supply world-wide (200-480 VAC or 370-575VDC), converts to 360 VDC for in-rack distribution. Full in-rack redundancy and automatic fail-over, 4 line cords. Up to 252 kW/rack max / 163 kW Typ.
• All data center power & cooling infrastructure included in compute/storage/network rack– No need for external power distribution or computer room air handling equipment.– All components correctly sized for max efficiency – extremely good 1.18 Power Utilization Efficiency– Integrated management for all compute, storage, network, power, & thermal resources.– Scales to 512K P7 cores (192 racks) – without any extraneous hardware except optical fiber cables
Servers – 256 Power7 cores / drawer, 1-12 drawers / rackCompute: 8-core Power7 CPU chip, 3.7 GHz, 12s technology, 32 MB L3 eDRAM/chip, 4-way SMT, 4 FPUs/core, Quad-Chip Module; >90 TF / rack No accelerators: normal CPU instruction set, robust cache/memory hierarchy Easy programmability, predictable performance, mature compilers & librariesMemory: 512 Mbytes/sec per QCM (0.5 Byte/FLOP), 12 Terabytes / rackExternal IO: 16 PCIe Gen2 x16 slots / drawer; SAS or external connectionsNetwork: Integrated Hub (HCA/NIC & Switch) per each QCM (8 / drawer), with 54-port switch, including total of 12 Tbits/s (1.1 TByte/s net BW) per Hub: Host connection: 4 links, (96+96) GB/s aggregate (0.2 Byte/FLOP) On-card electrical links: 7 links to other hubs, (168+168) GB/s aggregate Local-remote optical links: 24 links to near hubs, (120+120) GB/s aggregate Distant optical links: 16 links to far hubs (to 100M), (160+160) GB/s aggregate PCI-Express: 2-3 per hub, (16+16) to (20+20) GB/s aggregate
PERCS/Power7-IH System - Data-Center-In-A-Rack
16
Opt
ical
I/O
Logic Plane
Off
-chi
p op
tica
l sig
nals
On-
chip
op
tica
l tra
ffic
Photonic PlaneMemory Plane
3D Integrated Chip
Goal: Integrate Ultra-dense Photonic Circuits with electronics
- Increase off-chip BW- Allow on-chip optical routing and interconnect
WD
M W
DM
Switch fabric
WDM bit-parallel message
Message1Tbps aggregate BW
Core N
Deserializer
N parallel channels
Detector
Op
tical-
to-
Ele
ctr
ica
l
Serializer
Message1Tbps aggregate BW
N parallel channels
Modulator
N modulators at different wavelengths
Ele
ctr
ical-
to-
Op
tical
Core 1
2020 1mW/Gb/s $0.025/Gb/sVision for 2020: Silicon Nanophotonics for Optically connected 3-D Supercomputer Chip
17
CMOS front end (FEOL) photonic integration for compatibility
Advantages: Deeply scaled Nanophotonics Best quality silicon and litho
• Highest performance photonic devices • Lowest loss waveguides• Lowest power consumption• Most accurate passive λ control
Most dense integration with CMOS Same mask set, standard processing Same design environment (e.g. Cadence)
Photonics sharing Si layer with FET body
6-channel WDM
1…6123456
50um
BOX
Si
SOI
Optical waveguide
FET
M1
18
1. Tx: Ultra-compact 10 Gbps modulator Optics Express, Dec. 2007
10 Gbps; LMZM = 200 m
FEOL integrated Nanophotonic devices from IBM Si Photonics Group(Y. Vlasov, S. Assefa, W. Green, F. Xia, F. Horst, … www.research.ibm.com/photonics
2. Rx: Ge waveguide photodetectorOptical Fiber Communications, March 2009
3. Ultra-compact WDM multiplexersOptics Express May 2007, SPIE March 2008
Temperature insensitive; 30x40 m
4. High-throughput nanophotonic switchNature Photonics, April 2008
Error free switching at 40 Gbps
waveguideSi
Ge PD
40 Gbps at 1V; 8fF capacitance
19
Broad optical bandwidth for thermal and process stability and efficient use of optical spectrum
B. G. Lee, A. Biberman, P. Dong, M. Lipson, K. Bergman, PTL 20, 767-769 (2008).P. Dong, S. F. Preble, M. Lipson, Opt. Express 15, 9600-9605 (2007).W. M. J. Green, M. J. Rooks, L. Sekaric, and Y. A. Vlasov, Opt. Express 15, 17106-17113 (2007).Y. Vlasov, W. M. J. Green, and F. Xia, Nature Photonics 2, 242-246 (2008).J. Van Campenhout, W. Green, S. Assefa, and Y. A. Vlasov, Opt. Express 17, 24020-24029, 2009.
• Resonant or frequency sensitive devices:• Ring resonator comb filter:
– Very narrow filter bands– Sensitive temperature variations
• Conventional MZ device: – Wavelength-sensitive coupler– Wider filter band– Potential drawbacks:– Unused regions between bands, spectral
efficiency reduced
– Wavelengths are restricted by varying FSR from dispersion
– Fabrication tolerances may demand post-fabrication trimming
• Broadband solution: – Design a filter with a single ultra-
wide band, filled with multiple uniformly-spaced WDM channels
20
The work at IBM has been partially supported by DARPA through the
Advanced Photonic Switch (APS) program.
The work at IBM has been partially supported by DARPA through the
Advanced Photonic Switch (APS) program.
WIMZ Provides Large Bandwidth, Low Crosstalkand a CMOS-Compatible Drive Voltage
1.35 1.4 1.45 1.5 1.55 1.6 1.65-30
-25
-20
-15
-10
-5
0
wavelength (m)
tran
smit
tanc
e (d
B)
T12
(off) T11
(off) T12
(on) T11
(on)
110 nm-18 dB
VON = 1 VION = 3.5 mAT = 23 °C
1.35 1.4 1.45 1.5 1.55 1.6 1.65-30
-25
-20
-15
-10
-5
0
wavelength (m)
tran
smit
tanc
e (d
B)
30 nm
VON = 1 VION = 3.5 mAT = 23 °C
Reference MZ
• Measured with TE-polarized broadband LED and OSA• Normalized to total OFF-state power in both outputs
WIMZ
-18 dB
TT1111
TT1212
0 V 1 V
[J. Van Campenhout et al., Optics Express 17 (26) 2009]
21
Coupling Light to Si Photonics
tapered glass tapered glass waveguides waveguides match pitch, match pitch,
cross section, cross section, and NA of chipand NA of chip
standard 250standard 250--µµm pitch SM/PM m pitch SM/PM fiber array aligned in Vfiber array aligned in V--groove, groove, buttbutt--coupled to glass waveguidescoupled to glass waveguides
B. G. Lee, F. E. Doany, S. Assefa, W. M. J. Green, M. Yang, C. L. Schow, C. V. Jahnes, S. Zhang, J. Singer, V. I. Kopp, J. A. Kash, and Y. A. Vlasov, Proceedings of OFC 2010, paper PDPA4.
The work at IBM has been partially supported by DARPA through the
Advanced Photonic Switch (APS) program.
The work at IBM has been partially supported by DARPA through the
Advanced Photonic Switch (APS) program.
Multichannel tapered coupler allows interfacing 250-µm-pitch PM fiber array with 20-µm-pitch silicon waveguide array
8-channel coupling demonstrated < 1 dB optical loss at interface Uniform insertion loss and crosstalk
In collaboration with
Chiral Photonics
Chip Edge
WGcoupler
22
Power Issues
Exascale system example
• 0.2 B/Flop comm BW, 1 B/Flop memory BW
• 1 mW/Gb/s x 2*109 GF = 0.2 MW I/O Power
• Typical I/O 10+ mW/Gb/s... 2 MW I/O Power !! Ouch !
Link Type Power Efficiency (pJ/bit)
Distance 50 Tb/s Off-module I/O Power (W)
Electrical 3 < 1 m 150
EOE 7 < 100 m 350
Silicon Nanophotonics 2 - 3 km 100 - 150
Power will ultimately limit module escape BW to <~ 25 – 50 Tb/sReducing power of Si nanophotonic devices is key…
Year
mW
/Gb/
s
Electrical Link Power Efficiency Progress
20002002
20042006
20082010
0
5
10
15
20
25
30
35
40
45
23
Cost Analysis• Cost comparisons need system approach
• Design + BOM + Assembly + Test
• Same system optimized for electrical or optical technology – difficult
• HPC system power and cost needs are clear:
Copper
Optics
Year Peak Performance
(Bidi) Optical Bandwidth
Optics Power Efficiency(mW/Gb/s)
Optics Power Consumption
Cost($/Gb/s,
aggressive)
Optics Cost
2008 1PF 0.012PB/s (1.2x105Gb/s)
100 0.010 MW 10 $1 M
2012 10PF 1PB/s(107Gb/s)
60 0.50 MW 1.0 $9 M
2016 100PF 20PB/sec(2x108Gb/s)
10 2 MW 0.2 $30 M
2020 1000PF(1EF)
400PB/sec(4x109Gb/s)
3 10 MW 0.025 $80 M
(Table after Benner, 2009)Exascale: I/O consume power and $$ of system…
24
Summary
• Optical links provide escape bandwidth advantage
• Rack-to-rack links have been optical
• Off-card links just hitting BW wall (P7iH)
• Off-module links will hit wall this decade
• Costly to exceed 25 Gb/s in dense electrical buses
• Power issues will dominate I/O bandwidth growth
• Electrical I/O power efficiency improving greatly
• EOE solutions provide distance advantage, but higher power
• Si nanophotonics solutions need focus to reduce I/O power
• Optics will see increasing use as electrical BW limits approached
• Electrical solutions growing more costly
• Optical technology cost takedowns must continue for Exascale and general use
• Integrated Silicon Nanophotonics offers greatest promise to push BW
25
References
1) Barcelona Supercomputing Center (BCS), Barcelona, Spain; IBM Mare Nostrum system installed in 2004
2) T.J. Beukema, “Link Modeling Tool,” Challenges in Serial Electrical Interconnects, IEEE SSCS Seminar Fort Collins, CO, March 2007.
3) S. Rylov, S. Raynolds, D. Storaska, B. Floyd, M. Kapur, T. Zwick, S. Gowda, and M. Sorna, “10+ Gb/s 90-nm CMOS serial link demo in CBGA package,” IEEE Journal of Solid-State Circuits, vol. 40, no. 9 , pp. 1987-1991, Sept. 2005.
4) L. Shan, Y. Kwark, P. Pepeljugoski, M. Meghelli, T. Beukema, J. Trewhella, M. Ritter, “Design, analysis and experimental verification of an equalized 10 Gbps link,” DesignCon 2006.
5) J.F. Bulzacchelli, M. Meghelli, S.V. Rylov, W. Rhee, A.V. Rylyakov, H.A. Ainspan, B.D. Parker, M.P. Beakes, A. Chung, T.J. Beukema, P.K. Pepeljugoski, L. Shan, Y.H. Kwark, S. Gowda, and D.J. Friedman, “A 10-Gb/s 5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technology,” IEEE Journal of Solid-State Circuits, vol. 41, no. 12, pp. 2885-2900, Dec. 2006.
6) D.G. Kam, D. R. Stauffer, T.J. Beukema and M. B. Ritter, “Performance comparison of CEI-25 signaling options and sensitivity analysis,” OIF Physical Link Layer (PLL) working group presentation, November 2007.
7) M.B. Ritter et. al. “The Viability of 25 Gb/s On-board Signaling” 28 th ECTC, May 2008.8) F. Doany et al.: “Measurement of optical dispersion in multimode polymer waveguides”, LEOS Summer
Topical Meetings, June 2004.9) Alan Benner, “Optics in Servers – HPC Interconnect and Other Opportunities,” IEEE Photonics Society -
Winter Topicals 2010, Photonics for Routing and Interconnects, January 11, 2010
This work was supported in part by Defense Advanced Research Projects Agency under the contract numbers HR0011-06-C-0074, HR0011-07-9-002 and MDA972-03-0004.