+ All Categories
Home > Documents > Optical Technologies for Data Communication in Large Parallel Systems

Optical Technologies for Data Communication in Large Parallel Systems

Date post: 30-Jan-2016
Category:
Upload: kamran
View: 30 times
Download: 0 times
Share this document with a friend
Description:
Optical Technologies for Data Communication in Large Parallel Systems. Mark B. Ritter, Yurii Vlasov, Jeffrey A. Kash, and Alan Benner* IBM T.J. Watson Research Center, *IBM Poughkeepsie [email protected]. Outline. HPC Performance Scaling and Bandwidth Anatomy of a Link - PowerPoint PPT Presentation
Popular Tags:
25
IBM Research Optical Technologies for Data Communication in Large Parallel Systems Mark B. Ritter, Yurii Vlasov, Jeffrey A. Kash, and Alan Benner* IBM T.J. Watson Research Center, *IBM Poughkeepsie [email protected]
Transcript
Page 1: Optical Technologies for Data Communication in Large Parallel Systems

IBM Research

Optical Technologies for Data Communication in Large Parallel Systems

Mark B. Ritter, Yurii Vlasov, Jeffrey A. Kash, and Alan Benner* IBM T.J. Watson Research Center, *IBM Poughkeepsie [email protected]

Page 2: Optical Technologies for Data Communication in Large Parallel Systems

2

Outline• HPC Performance Scaling and Bandwidth

• Anatomy of a Link

• Electrical and Optical Interconnect Limits

• Promise of Nanophotonic Technology

• Potential Insertion Points

• Summary

Page 3: Optical Technologies for Data Communication in Large Parallel Systems

3

Performance Scaling Now Driven by Communication

System performance gains no longer principally from lithography-driven uniprocessor performance

Performance gains now from parallelism exploited at chip, system level

BW requirements must scale with System Performance, ~1B/FLOP (memory & network)

Requires exponential increases in communication bandwidth at all levels of the system

• Inter-rack, backplane, card, chip

Chip Performance(Olukotun et al.)

SystemPerformance

Page 4: Optical Technologies for Data Communication in Large Parallel Systems

4

Bandwidth: the Bane of the Multicore Paradigm:

Logic flops continue to scale faster than interconnect BW

• Constant Byte/Flop ratio with N cores (constant ) means:

Bandwidth(N-core) = N x Bandwidth(single core)

• 3Di (3D integration) will only exacerbate bottlenecks

Assumptions:• 3 GHz clock• ~ 3 IPC• 10 Gb/s I/O

• 1 B/Flop mem• 0.1 B/Flop data• 0.05 B/Flop I/O

Pins per chip

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 2 4 8 16 32 64 128

Number of Cores

Sig

nal

+ R

efer

ence

Pin

sS

igna

l + R

efer

ence

Pin

s

Page 5: Optical Technologies for Data Communication in Large Parallel Systems

5

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 2 4 8 16 32 64 128

Number of Cores

Sig

nal

+ R

efer

ence

Pin

s

Implications of BW Scaling:

Only several generations left before module I/O limit is hit… what sets limits?

Module Escape Bottleneck

Card Escape Bottleneck(already breached)

Chip escape limit, 200m pitch

Module escape, 1mm pitch

Card escape, 8 pair/mm(QCM w/8 Cores…)

Pins per chip

Sig

nal

+ R

efer

ence

Pin

s

Page 6: Optical Technologies for Data Communication in Large Parallel Systems

6

Anatomy of Communication Links:

Serializer TxFFE Deserializer Rx

DFEElectrical Channel

Serializer LaserDrive

Deserializer RxAmp

Vb2

Fiber Or WG

LVb1

Serializer ModDrive

Deserializer RxAmp

V

Fiber Or WG

DC LaserOff chip

Silicon WG

SiMod

WG

III-VVCSEL

PD

Ge PD

ChipBoundary

ChipBoundary

850nm

1300 or1550 nm

L <~ 1m PCB, few meters cable @ 10 Gb/s

L : cm to 300m

L : cm to km

ELECTRICAL I/O

OFF-CHIP OPTICAL MODULE

INTEGRATED SILICON NANOPHOTONICS

All links have same basic features, the differences are in modulation and detection, and these differences determine power efficiency, distance x bandwidth, and density…

OE ModuleOE Module

Page 7: Optical Technologies for Data Communication in Large Parallel Systems

7

Electrical Interconnect ModelingIC 1

IC 2

Module 1 Module 2

High-Speed Links(15 to 60 cm)

5 10 15 20 25 30 35 40 450 50

-90

-80

-70

-60

-50

-40

-30

-20

-10

-100

0

freq, GHz

dB(M

egtron6_4to

1_lo

ng_S9_BB4P_2..S(1

,2))

dB(M

egtron6_4to

1_lo

ng_S9_BB4P_2..S(3

,4))

dB(S

(11,2

7))

dB(S

(12,2

8))

Insertion loss (single-ended):

Simulation

versus

Measurement

Module C4 Module BGALGA

Module C4Module BGALGA

PCBtlines

Modeling accuracy confirmed with measurement, model limits of electrical…

Page 8: Optical Technologies for Data Communication in Large Parallel Systems

8

Electrical Interconnect Limits• Module-to-module on-board limits:

• Off-board (backplane):

– Limits board-to-board bitrates to ~6.4 Gb/s for typical server configurations

• Rack-to-rack – already optical

Megtron60

5

10

15

20

25

30

35

40

15 30 45 60 90 120

Link Distance [cm]

Thro

ughp

ut [G

b/s]

NRZDuobinaryPAM4

NRZ with FFE and DFE (and/or CTLE) best modulation for dense buses.

Achieve 25 Gb/s @ 45cm…

Costly dielectrics for > 25 Gb/s…

Page 9: Optical Technologies for Data Communication in Large Parallel Systems

9

Optical Interconnect Modeling

• Lowest-power links use optics as “analog repeaters” of signal with no clock recovery

– E-O-E modeling required: jitter adds up over two electrical links, one optical link

Optomodule

Optocard

Lens ArrayLens ArraySLC

Transceiver IC

OESLC

Transceiver IC

OEOE

Top Bottom

Cutout withOEs-on-IC

IC

B G A

Top Bottom

Cutout withOEs-on-IC

IC

B G A

Waveguide

Terabus“Optomodule”Kash et. al.

Page 10: Optical Technologies for Data Communication in Large Parallel Systems

10

Optical Interconnect Modeling

• Two electrical on-module links, each with ~ 30 GHz media BW

• One WG optical link with ~ 40 GHz media BW

• Assume electrical, optical I/O do not limit link BW, get 26 GHz BW

• Actual link includes electrical, optical I/O BW, drives system BW to < ~18 GHz

- This limits overall EOE link bitrate to ~ 26 Gb/s @ 1 meter

• Our models include EOE link using full dual-Dirac jitter convolution for end-to-end composite link

Base circuit board with optical waveguides & turning mirrors

Lens array

Organic package

CPU CMOS TRX and SLC

OE

50mm Organic Module

TLINETx TxWG

TLINE RxTLINETx Tx RxWG

CPU module OE WG OE module CPU

E O E link

Page 11: Optical Technologies for Data Communication in Large Parallel Systems

11

20 40 60 80 100 1200

10

20

30

40

Distance [cm]

Ma

x D

ata

Ra

te [G

b/s

]

FFE + DFE

TELL HardwareNo IC Parasitics

Simulations with I/O limitations

No IC parasitics

Electrical and Optical Link Reach

20 40 60 80 100 120 140 1600

10

20

30

40

50

60

70

80

90

100

Distance [cm]

Maxim

um Da

ta Ra

te [G

b/s]

EOE with 10G Terabus OpticsEOE with 20G Terabus Optics20G Terabus Optics OnlyIdeal, Channel Limit Only

Optical WG limit

Predicted link reach @ 25 Gb/s: ~45 cm electrical links (Megtron 6)

~100 cm optical WG links

25 Gb/s

EOE links double the reach (current WG loss)

Page 12: Optical Technologies for Data Communication in Large Parallel Systems

12

It's All About Bandwidth Escape:

Bandwidth of Elements (Tb/s)

OpticalElectrical

OE Escape46-100

Optical WG64-166

C4 – 90 - 211

Module56 - 112

C4 – 90 - 211

Card 73 - 136

LGA 12 – 23.5

Module56 - 112

LGA escape 17 – 29

P1S2P3S4P5S6P7P8P9

P10S11P12S13P14P15S16P17S18P19S20P21P22S23P24S25P26S27P28

PCB

Chip

Module

LGA connector

Top Bottom

Cutout withOEs-on-IC

IC

B G A

Top Bottom

Cutout withOEs-on-IC

IC

B G A

Multimode optical transceivers could provide ~4x module escape BW

Actual Terabus

Notional Design

Assume 60mm module, 1mm LGA pitch, 62.5μm WG pitch, what is escape BW?

OEs AroundPerimeter

Page 13: Optical Technologies for Data Communication in Large Parallel Systems

13

Escape Bandwidth ConclusionsFor the high-end of HPC:

• Hit Rack electrical BW limit in early 2000's

• Hitting off-board BW now, P7 IH chose optics for off-board links

• Likely to hit electrical off-module BW limit soon (some packaging “fixes” - larger modules (S21 on module greater than board- tradeoff)

Optical

P7 IH NodeOptical off card

12 Tb/s/Hub

Page 14: Optical Technologies for Data Communication in Large Parallel Systems

14

P7 IH System Hardware – Node Front View (~1000 Nodes in Blue Waters)

P7 QCM (8x)

Hub Module (8x)

D-Link Optical InterfaceConnects to other Super Nodes

360VDC Input Power Supplies

Water Connection

L-Link Optical InterfaceConnects 4 Nodes to form Super Node

MemoryDIMM’s (64x)

MemoryDIMM’s (64x)

PCIe Interconnect

1m W x 1.8m D x 10cm H

IBM’s HPCS Programpartially supported by

MLC ModuleHub Assembly

PCIe Interconnect

D-Link Optical InterfaceConnects to other Super Nodes

Avago microPODTM All off-node communication optical

Page 15: Optical Technologies for Data Communication in Large Parallel Systems

15

Integrated Storage– 384 2.5” drives / drawer, 0-6 drawers / rack230 TBytes\drawer (w/ 600GB 10K SAS disks), full RAID, 154 GB/s BW/drawerStorage Drawers replace server drawers at 2-for-1 (up to 1.38 PetaBytes / rack)

Integrated Cooling – Water pumps and heat exchangersAll thermal load transferred directly to building chilled water – no load on room

Integrated Power Regulation, Control, & DistributionRuns off any building voltage supply world-wide (200-480 VAC or 370-575VDC), converts to 360 VDC for in-rack distribution. Full in-rack redundancy and automatic fail-over, 4 line cords. Up to 252 kW/rack max / 163 kW Typ.

• All data center power & cooling infrastructure included in compute/storage/network rack– No need for external power distribution or computer room air handling equipment.– All components correctly sized for max efficiency – extremely good 1.18 Power Utilization Efficiency– Integrated management for all compute, storage, network, power, & thermal resources.– Scales to 512K P7 cores (192 racks) – without any extraneous hardware except optical fiber cables

Servers – 256 Power7 cores / drawer, 1-12 drawers / rackCompute: 8-core Power7 CPU chip, 3.7 GHz, 12s technology, 32 MB L3 eDRAM/chip, 4-way SMT, 4 FPUs/core, Quad-Chip Module; >90 TF / rack No accelerators: normal CPU instruction set, robust cache/memory hierarchy Easy programmability, predictable performance, mature compilers & librariesMemory: 512 Mbytes/sec per QCM (0.5 Byte/FLOP), 12 Terabytes / rackExternal IO: 16 PCIe Gen2 x16 slots / drawer; SAS or external connectionsNetwork: Integrated Hub (HCA/NIC & Switch) per each QCM (8 / drawer), with 54-port switch, including total of 12 Tbits/s (1.1 TByte/s net BW) per Hub: Host connection: 4 links, (96+96) GB/s aggregate (0.2 Byte/FLOP) On-card electrical links: 7 links to other hubs, (168+168) GB/s aggregate Local-remote optical links: 24 links to near hubs, (120+120) GB/s aggregate Distant optical links: 16 links to far hubs (to 100M), (160+160) GB/s aggregate PCI-Express: 2-3 per hub, (16+16) to (20+20) GB/s aggregate

PERCS/Power7-IH System - Data-Center-In-A-Rack

Page 16: Optical Technologies for Data Communication in Large Parallel Systems

16

Opt

ical

I/O

Logic Plane

Off

-chi

p op

tica

l sig

nals

On-

chip

op

tica

l tra

ffic

Photonic PlaneMemory Plane

3D Integrated Chip

Goal: Integrate Ultra-dense Photonic Circuits with electronics

- Increase off-chip BW- Allow on-chip optical routing and interconnect

WD

M W

DM

Switch fabric

WDM bit-parallel message

Message1Tbps aggregate BW

Core N

Deserializer

N parallel channels

Detector

Op

tical-

to-

Ele

ctr

ica

l

Serializer

Message1Tbps aggregate BW

N parallel channels

Modulator

N modulators at different wavelengths

Ele

ctr

ical-

to-

Op

tical

Core 1

2020 1mW/Gb/s $0.025/Gb/sVision for 2020: Silicon Nanophotonics for Optically connected 3-D Supercomputer Chip

Page 17: Optical Technologies for Data Communication in Large Parallel Systems

17

CMOS front end (FEOL) photonic integration for compatibility

Advantages: Deeply scaled Nanophotonics Best quality silicon and litho

• Highest performance photonic devices • Lowest loss waveguides• Lowest power consumption• Most accurate passive λ control

Most dense integration with CMOS Same mask set, standard processing Same design environment (e.g. Cadence)

Photonics sharing Si layer with FET body

6-channel WDM

1…6123456

50um

BOX

Si

SOI

Optical waveguide

FET

M1

Page 18: Optical Technologies for Data Communication in Large Parallel Systems

18

1. Tx: Ultra-compact 10 Gbps modulator Optics Express, Dec. 2007

10 Gbps; LMZM = 200 m

FEOL integrated Nanophotonic devices from IBM Si Photonics Group(Y. Vlasov, S. Assefa, W. Green, F. Xia, F. Horst, … www.research.ibm.com/photonics

2. Rx: Ge waveguide photodetectorOptical Fiber Communications, March 2009

3. Ultra-compact WDM multiplexersOptics Express May 2007, SPIE March 2008

Temperature insensitive; 30x40 m

4. High-throughput nanophotonic switchNature Photonics, April 2008

Error free switching at 40 Gbps

waveguideSi

Ge PD

40 Gbps at 1V; 8fF capacitance

Page 19: Optical Technologies for Data Communication in Large Parallel Systems

19

Broad optical bandwidth for thermal and process stability and efficient use of optical spectrum

B. G. Lee, A. Biberman, P. Dong, M. Lipson, K. Bergman, PTL 20, 767-769 (2008).P. Dong, S. F. Preble, M. Lipson, Opt. Express 15, 9600-9605 (2007).W. M. J. Green, M. J. Rooks, L. Sekaric, and Y. A. Vlasov, Opt. Express 15, 17106-17113 (2007).Y. Vlasov, W. M. J. Green, and F. Xia, Nature Photonics 2, 242-246 (2008).J. Van Campenhout, W. Green, S. Assefa, and Y. A. Vlasov, Opt. Express 17, 24020-24029, 2009.

• Resonant or frequency sensitive devices:• Ring resonator comb filter:

– Very narrow filter bands– Sensitive temperature variations

• Conventional MZ device: – Wavelength-sensitive coupler– Wider filter band– Potential drawbacks:– Unused regions between bands, spectral

efficiency reduced

– Wavelengths are restricted by varying FSR from dispersion

– Fabrication tolerances may demand post-fabrication trimming

• Broadband solution: – Design a filter with a single ultra-

wide band, filled with multiple uniformly-spaced WDM channels

Page 20: Optical Technologies for Data Communication in Large Parallel Systems

20

The work at IBM has been partially supported by DARPA through the

Advanced Photonic Switch (APS) program.

The work at IBM has been partially supported by DARPA through the

Advanced Photonic Switch (APS) program.

WIMZ Provides Large Bandwidth, Low Crosstalkand a CMOS-Compatible Drive Voltage

1.35 1.4 1.45 1.5 1.55 1.6 1.65-30

-25

-20

-15

-10

-5

0

wavelength (m)

tran

smit

tanc

e (d

B)

T12

(off) T11

(off) T12

(on) T11

(on)

110 nm-18 dB

VON = 1 VION = 3.5 mAT = 23 °C

1.35 1.4 1.45 1.5 1.55 1.6 1.65-30

-25

-20

-15

-10

-5

0

wavelength (m)

tran

smit

tanc

e (d

B)

30 nm

VON = 1 VION = 3.5 mAT = 23 °C

Reference MZ

• Measured with TE-polarized broadband LED and OSA• Normalized to total OFF-state power in both outputs

WIMZ

-18 dB

TT1111

TT1212

0 V 1 V

[J. Van Campenhout et al., Optics Express 17 (26) 2009]

Page 21: Optical Technologies for Data Communication in Large Parallel Systems

21

Coupling Light to Si Photonics

tapered glass tapered glass waveguides waveguides match pitch, match pitch,

cross section, cross section, and NA of chipand NA of chip

standard 250standard 250--µµm pitch SM/PM m pitch SM/PM fiber array aligned in Vfiber array aligned in V--groove, groove, buttbutt--coupled to glass waveguidescoupled to glass waveguides

B. G. Lee, F. E. Doany, S. Assefa, W. M. J. Green, M. Yang, C. L. Schow, C. V. Jahnes, S. Zhang, J. Singer, V. I. Kopp, J. A. Kash, and Y. A. Vlasov, Proceedings of OFC 2010, paper PDPA4.

The work at IBM has been partially supported by DARPA through the

Advanced Photonic Switch (APS) program.

The work at IBM has been partially supported by DARPA through the

Advanced Photonic Switch (APS) program.

Multichannel tapered coupler allows interfacing 250-µm-pitch PM fiber array with 20-µm-pitch silicon waveguide array

8-channel coupling demonstrated < 1 dB optical loss at interface Uniform insertion loss and crosstalk

In collaboration with

Chiral Photonics

Chip Edge

WGcoupler

Page 22: Optical Technologies for Data Communication in Large Parallel Systems

22

Power Issues

Exascale system example

• 0.2 B/Flop comm BW, 1 B/Flop memory BW

• 1 mW/Gb/s x 2*109 GF = 0.2 MW I/O Power

• Typical I/O 10+ mW/Gb/s... 2 MW I/O Power !! Ouch !

Link Type Power Efficiency (pJ/bit)

Distance 50 Tb/s Off-module I/O Power (W)

Electrical 3 < 1 m 150

EOE 7 < 100 m 350

Silicon Nanophotonics 2 - 3 km 100 - 150

Power will ultimately limit module escape BW to <~ 25 – 50 Tb/sReducing power of Si nanophotonic devices is key…

Year

mW

/Gb/

s

Electrical Link Power Efficiency Progress

20002002

20042006

20082010

0

5

10

15

20

25

30

35

40

45

Page 23: Optical Technologies for Data Communication in Large Parallel Systems

23

Cost Analysis• Cost comparisons need system approach

• Design + BOM + Assembly + Test

• Same system optimized for electrical or optical technology – difficult

• HPC system power and cost needs are clear:

Copper

Optics

Year Peak Performance

(Bidi) Optical Bandwidth

Optics Power Efficiency(mW/Gb/s)

Optics Power Consumption

Cost($/Gb/s,

aggressive)

Optics Cost

2008 1PF 0.012PB/s (1.2x105Gb/s)

100 0.010 MW 10 $1 M

2012 10PF 1PB/s(107Gb/s)

60 0.50 MW 1.0 $9 M

2016 100PF 20PB/sec(2x108Gb/s)

10 2 MW 0.2 $30 M

2020 1000PF(1EF)

400PB/sec(4x109Gb/s)

3 10 MW 0.025 $80 M

(Table after Benner, 2009)Exascale: I/O consume power and $$ of system…

Page 24: Optical Technologies for Data Communication in Large Parallel Systems

24

Summary

• Optical links provide escape bandwidth advantage

• Rack-to-rack links have been optical

• Off-card links just hitting BW wall (P7iH)

• Off-module links will hit wall this decade

• Costly to exceed 25 Gb/s in dense electrical buses

• Power issues will dominate I/O bandwidth growth

• Electrical I/O power efficiency improving greatly

• EOE solutions provide distance advantage, but higher power

• Si nanophotonics solutions need focus to reduce I/O power

• Optics will see increasing use as electrical BW limits approached

• Electrical solutions growing more costly

• Optical technology cost takedowns must continue for Exascale and general use

• Integrated Silicon Nanophotonics offers greatest promise to push BW

Page 25: Optical Technologies for Data Communication in Large Parallel Systems

25

References

1) Barcelona Supercomputing Center (BCS), Barcelona, Spain; IBM Mare Nostrum system installed in 2004

2) T.J. Beukema, “Link Modeling Tool,” Challenges in Serial Electrical Interconnects, IEEE SSCS Seminar Fort Collins, CO, March 2007.

3) S. Rylov, S. Raynolds, D. Storaska, B. Floyd, M. Kapur, T. Zwick, S. Gowda, and M. Sorna, “10+ Gb/s 90-nm CMOS serial link demo in CBGA package,” IEEE Journal of Solid-State Circuits, vol. 40, no. 9 , pp. 1987-1991, Sept. 2005.

4) L. Shan, Y. Kwark, P. Pepeljugoski, M. Meghelli, T. Beukema, J. Trewhella, M. Ritter, “Design, analysis and experimental verification of an equalized 10 Gbps link,” DesignCon 2006.

5) J.F. Bulzacchelli, M. Meghelli, S.V. Rylov, W. Rhee, A.V. Rylyakov, H.A. Ainspan, B.D. Parker, M.P. Beakes, A. Chung, T.J. Beukema, P.K. Pepeljugoski, L. Shan, Y.H. Kwark, S. Gowda, and D.J. Friedman, “A 10-Gb/s 5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technology,” IEEE Journal of Solid-State Circuits, vol. 41, no. 12, pp. 2885-2900, Dec. 2006.

6) D.G. Kam, D. R. Stauffer, T.J. Beukema and M. B. Ritter, “Performance comparison of CEI-25 signaling options and sensitivity analysis,” OIF Physical Link Layer (PLL) working group presentation, November 2007.

7) M.B. Ritter et. al. “The Viability of 25 Gb/s On-board Signaling” 28 th ECTC, May 2008.8) F. Doany et al.: “Measurement of optical dispersion in multimode polymer waveguides”, LEOS Summer

Topical Meetings, June 2004.9) Alan Benner, “Optics in Servers – HPC Interconnect and Other Opportunities,” IEEE Photonics Society -

Winter Topicals 2010, Photonics for Routing and Interconnects, January 11, 2010

This work was supported in part by Defense Advanced Research Projects Agency under the contract numbers HR0011-06-C-0074, HR0011-07-9-002 and MDA972-03-0004.


Recommended