+ All Categories
Home > Documents > Photonic Networks for Intra-Chip, Inter-Chip, and Box...

Photonic Networks for Intra-Chip, Inter-Chip, and Box...

Date post: 07-Mar-2018
Category:
Upload: lyngoc
View: 217 times
Download: 3 times
Share this document with a friend
71
Photonic Networks for Intra-Chip, Inter-Chip, and Box Interconnects in High-Performance Computing Keren Bergman Columbia University Department of Electrical Engineering
Transcript
Page 1: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Photonic Networks for Intra-Chip, Inter-Chip, and

Box Interconnects in High-Performance Computing

Keren Bergman

Columbia UniversityDepartment of Electrical Engineering

Page 2: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(1) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Table of Contents

1. Introduction2. Large Scale Supercomputing Systems3. Photonics Design Considerations4. Interconnection Network Architectures5. Implementations: OSMOSIS, Data Vortex6. Off-Chip Bottlenecks7. Photonic Network-on-Chip8. Emerging Enabling Technologies9. SPINet Design and Implementation10. Intra-Chip Challenges11. Future Directions and Opportunities

Page 3: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(2) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Top 10 Supercomputers (June 2006)

Rank Site Manuf. Computer Country Processors RPeak ProcessorProc. Freq.

System Family Arch. Interconnect

1DOE/NNSA/

LLNL IBMeServer Blue Gene

SolutionUnited States 131072 367000

PowerPC 440 700

IBM BlueGene/L MPP Proprietary

2

IBM Thomas J. Watson Research

Center IBMeServer Blue Gene

SolutionUnited States 40960 114688

PowerPC 440 700

IBM BlueGene/L MPP Proprietary

3DOE/NNSA/

LLNL IBMeServer pSeries p5

575 1.9 GHzUnited States 12208 92781 POWER5 1900

IBM pSeries MPP

SP Switch Federation

4

NASA/Ames Research

Center/NAS SGISGI Altix 1.5 GHz, Voltaire Infiniband

United States 10160 60960

Intel IA-64 Itanium 2 1500 SGI Altix MPP

Numalink/ Infiniband

5

Commissariat a l'Energie

Atomique (CEA) Bull SA

NovaScale 5160, Itanium2 1.6 GHz,

Quadrics France 8704 55705.6Intel IA-64 Itanium 2 1600

Bull SMP Cluster

Constellations Quadrics

6Sandia National

Laboratories DellPowerEdge 1850,

3.6 GHz, InfinibandUnited States 9024 64972.8

Intel EM64T Xeon EM64T 3600

Dell PowerEdge

Cluster Cluster Infiniband

7

GSIC Center, Tokyo Institute of

Technology NEC/Sun

Sun Fire X4600 Cluster, Opteron

2.4/2.6 GHz, Infiniband Japan 10368 49868.8

AMD x86_64 Opteron Dual

Core 2400Sun Fire -

Cluster Cluster Infiniband

8Forschungszentrum Juelich (FZJ) IBM

eServer Blue Gene Solution Germany 16384 45875

PowerPC 440 700

IBM BlueGene/L MPP Proprietary

9Sandia National

Laboratories Cray Inc.Red Storm Cray

XT3, 2.0 GHzUnited States 10880 43520

AMD x86_64 Opteron 2000 Cray XT3 MPP

Cray XT3 Internal

Interconnect

10The Earth

Simulator Center NEC Earth-Simulator Japan 5120 40960 NEC 1000 NEC Vector MPPMulti-stage

crossbar

Page 4: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(3) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

System Performance GFlops/Watt

Optimized performance/Watt:Performance/rack = performance/Watt x Watt/rackWatt/rack ~ constant for air cooled, ~20kWUse low power low frequency processor cores

Key system metric:peak Flops/total powerLarge number of moderate frequency processors requires EXTREME SCALINGNetwork must scale in performance + packaging

Page 5: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(4) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

IBM BG/L Interconnect• System Peak 360 TFlops• 65,536 (216) dual core • 1024 dual core processor

nodes per rack– 27.5kW– ~0.25 GFlop/Watt– ~85% of inter-node

connectivity

• Main compute interconnect: 3D torus (64x32x32)• BW: 2.1 GB/s inter-node

• MPI Latency: > 2 µs (strong load dependence)

Page 6: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(5) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Cray XT3

• System Peak 43.5 TFlop• Interconnect:

– 3D torus– 6 switch ports per SeaStar

7.6GB/s each 45.6GB/s• 10,880 compute PEs• Interconnect bisection bandwidth

11.7 TB/s

• 64-bit AMD Opteron 100• 96 dual core 2.6GHz µproc• 998 GFlops per cabinet• 14.5 kW per cabinet

~0.07GFlops/Watt

Page 7: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(6) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Bull SA NovaScale (Tera 10)

CEA: Commissariat àl’Énergie Atomique

(France 's Atomic Energy Authority)

Europe’s top SupercomputerSystem Peak 55.7 TFlopsInterconnect:• Quadrics QsNet/QsNetII

• 8-port ASIC routers • Fat-tree topology• Scalability: up to 4K nodes• BW: 900 MBytes/s per node. • MPI Latency: ~2-3 µs

Page 8: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(7) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Box Interconnection Networks• HPCS require interconnection networks to deliver

– ultra-low latency message exchange– dynamic, bursty bandwidth– self-conflict resolving packet routing – capacity approaching Pbytes– port count scalability (>1k to 10k)– flexible packet sizes– small messages efficiency (GUPS)– High-bandwidth processing– Significantly lower power consumption

• Broader applications– optical interconnections for chips or chipsets (cost,

footprint, power dissipation critical)– optical backplanes (cost, scalability)– high-capacity routers

OPSnetwork

OPSnetwork

Page 9: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(8) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Performance Trends & Photonics Opportunity

• Increases in performance of individual CPU chips will come from:– number of processor cores per chip – number of parallel functional units

• One of most important features of a massively parallel supercomputer is the network that connects the processors together and allows:– machine to operate as a large coherent entity

• Interconnection network must SCALE in highly parallel system• Address power consumption with scaling• Scalability: bandwidth, latency, throughput

• Photonic Opportunity: bandwidth (WDM), throughput, power efficiency, latency

Page 10: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(9)COLUMBIAUNIVERSITY

Box Interconnect: key metrics• high-bandwidth low latency communication performance

between nodes necessary to provide superb parallel efficiency for applications running on petaFLOPS-scale computing systems.

• Communication networks characterized by five critical performance metrics:– Message Latency (end-to-end message exchange)– Message Throughput (messages processed at node)– Message Bandwidth (exchanged message bandwidth)– Load/Store Bandwidth (load/store operations per second)– Bisectional Bandwidth (global network bandwidth)

Page 11: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[10]Columbia University

Architectural Considerations

• Constraints :O/E and E/O conversions expensivenontrivial attenuation, signal degradationno buffering (FDL only)poor signal processing

• Features :wavelength parallelismhigh channel bandwidth

Leverage unique features of both photonics and electronics.

Page 12: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[11]Columbia University

Architectural Considerations

Leverage unique features of both photonics and electronics.

• Optics :ultrahigh-bandwidth transmissionspeed of light latencyefficient propagation

• Electronics :digital logicsignal processingbuffering

Page 13: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[12]Columbia University

Figures of Merit

System :

• power consumption

• cost• reliability• serviceability

Network :

• acceptance rate• throughput• latency

Physical :

• transparency• dynamic range• stability

Page 14: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[13]Columbia University

Architectural Foundations

PhotonicInterconnection

Network

Electronic Control

Page 15: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[14]Columbia University

Architectural Foundations

• No buffering →

deflection routingjudicious I/O queuingover-provisioning of paths

• high-bandwidth multiple-wavelength encoding

• low-latency electronic routing control

• Simplicity →

banyan routingMIN topologymodularity

Page 16: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(15) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Switching Node Design

λF

λH

O/E

O/E

controllogic

50:50

50:50

70:30

70:30

North East

SOA

SOA

West South

multi-λ packet

wavelength-parallel transparent routing

Page 17: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(16) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Packet Structure

slot (25.7 ns) packet (22.4 ns)

payload (19.3 ns)

FrameH0

H2H3P0…

P15

Frame

H1H2

P0…

P15

deadtime (3.3 ns) guardtime (1.6 ns)

multiple-wavelength packet

Page 18: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[17]Columbia University

Implemented Switching Node

Page 19: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(18)COLUMBIAUNIVERSITY

Lightwave Research Laboratory

Packet StructureH2H3 H1 H0 F

wavelength (nm)

aver

age

pow

er (d

Bm

)

Page 20: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(19)COLUMBIAUNIVERSITY

Lightwave Research Laboratory

Multi-wavelength Switch Block

Truly broadband switching of multi-wavelength packets using a single switch

Single Wavelength Switch

P dissipated,single wavelength = P dissipated,multi-wavelength

Multi-Wavelength Switch

Page 21: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[20]Columbia University

Topologies

simple banyan (e.g. omega)

n = ½ N log N

Page 22: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[21]Columbia University

Topologies

Clos network (e.g. Beneš network)

n = ½ N (2 log N–1)

Page 23: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[22]Columbia University

Topologies

augmented banyan (e.g. omega)

n = ½ N (log N+K)

Page 24: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[23]Columbia University

Topologies

Data Vortex (cyclic butterfly)

input nodes

output nodes

n1×2 = A N (log N+1)

Page 25: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(24) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Data Vortex Topology

input node

output node

[1110]

(0,0,0)(0,2,1)(1,2,2)(1,3,0)(2,3,1)

[0xxx][1xxx][10xx][11xx][1110]

Page 26: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(25) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Data Vortex Topology

output node input node

west

east

north

south

deflectionstructure

Page 27: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(26) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

System Implementation

DV Experimental System12×12 switch~100 ns routing latency160 Gbps per portTerabit capacity

Page 28: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(27) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

System Implementation

Page 29: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(28) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Experimental Setup

BERT

EDFA

λ

PPG DTG

F

mod

H0

H1

H2

H3

P0

mod

mod

mod

mod

mod

GatingSOA

P1 Rx

P15

5

10 Gbps ~39 Mbps

BoosterSOA

5-node path

12×12Data Vortex

Network

Page 30: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(29) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Routing DemonstrationPFH0H1H2H3

00010010001101010110011110011010101111011110111150 ns/div

7 hops ≈ 160 ns

3 hops ≈ 60 ns

Page 31: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(30) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Deflection Demonstration

input #7

+3 hop deflection

20 ns/div

input #4

F

H0

H1

H2

H3

1101

Page 32: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(31) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Error-Free Transmission

Page 33: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(32) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

OSNR DegradationH2H3 H1 H0 F

Page 34: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(33) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Dynamic Power Range

Page 35: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Bergman, ECOC’’0634

Just-in-Time Optical Cell Switching• Fast optical switch for fixed size data packets (cells)• Transparent data path with multiple cells in flight• Out of band electronic control path• Just-in-time switching as cells arrive

Control Switched 40GOptical Cells

5 ns/div

Just-in-TimeOptical Switching

Rx nodesTx nodes(with electricalVOQ buffers)

Page 36: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Bergman, ECOC’’0635

Bufferless Crossbar Design: Implemented via 2 Stage Broadcast and Select Architecture

EDFAs

40 Gb/s Packet receiversMux’s

Fast SOA Color-Selector

Gates

Fast SOA Fiber-Selector

GatesStar Couplers

1x128

#0a,b

.

.

.

#63a,b

Laser-IntegratedModulators

8x1Combine

40 Gb/s transmitters

Optical Gain Semiconductor Optical Amplifier On-Off Gates

High Sensitivity ReceiversS: Multiple fibers (8 scaling to 40+) λ : Multiple colors per fiber (8 scaling to 100+) T: Switching time (~2 ns scaling to <0.1ns)

High bit rates (40G scaling to 100G+)

Page 37: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Bergman, ECOC’’0636

OSMOSIS demonstrator prototype

Optical Switch

Source/Sink Cluster Proxies + I/O Adapters

Arbiterprototype

Management GUI

Page 38: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Bergman, ECOC’’0637

Error-free Cell Transmission with Wide Dynamic Range: 20 Mcell/s, 64 ports, 40 Gb/s per port

5 dB Cell-Cell Dynamic Range

Error-free Recovery

+2.0 dBm

40 Gb/s eye diagram25 ps

20 ns/div

Programmable cell structure

Packet 1 Packet 2 500 ps/div

Data DataPost-amble

PreambleInter-

packet gap

-14 -12 -10 -8 -6 -4 -2 0 2 4 61E-12

1E-11

1E-10

1E-9

1E-8

1E-7

1E-6

1E-5

1E-4

1E-3

0.01

7.5 dB@ 28 dB OSNRSensitiv ity

BE

R

Received power (dBm )

OSNR ~45 dB 29.8 dB 28.8 dB 27.8 dB 26.8 dB

High sensitivity receiver for scaling

Page 39: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Bergman, ECOC’’0638

Measured Performance Summary• Channel Performance:

– Data path bit rate: 40 Gbit/sec/port 64 ports– Control path bit rate: 2.5 Gbit/sec/port– Received OSNR: >35dB– Data cell size: 2048 bits– Data cell structure: fully programmable (via FPGA)– Latency: <500ns– Efficiency: 75%– Bit Error Rate: <10-14 switched, uncorrected

• Correctable by FEC and protocol to <10-21 BER • System Performance:

– Switch Size: 64 ports at initial implementation– Out of band control channel at 20 Megacell/sec/port– Switching at every cell boundary under full load

Page 40: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Bergman, ECOC’’0639

Semiconductor Optical Amplifier Switches• Switches entire data cells • Fast switching: 0.1 – 2 nanoseconds.• Inherent gain (~20dB)• High on-off ratio (>45dB)• Low polarization sensitivity (<0.6 dB)• Low noise figure (<6.5 dB)• Broadband & WDM friendly (>80 nm)• Monolithically integratable• Future ultra-fast all-optical capability

Electrically Switched SOA at 1 GHzMonolithic

SOA array

Optically SwitchedSOA at 80GHzDiscrete SOA

12

13

14

15

16

17

18

19

-20 -15 -10 -5 0 5 10Total power into SOA (dBm), 8 channels

Q fa

ctor

(dB)

10-12 BER

20 dB dynamic range

8x40Gb/s Capacity

Page 41: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Bergman, ECOC’’0640

Towards Commercialization:Optical integration provides 10-30X benefit

Feasibility Demonstrator

450m

m

450mm

InP 8x1 CombinerMonolithicSOA Array

MonolithicOptical Interface

Silicon Arrayed Waveguide

Discrete Devices Integrated BenefitPower: W ~250 ~8 >30X

Complexity: Parts ~2000 ~100 20XSize: sq m. ~0.2 ~0.015 >10X

Integrated Prototype

10 Tbit/sec per shelf

Page 42: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

Lightwave Research Laboratory[41]Columbia University

Off-Chip Interconnects

Inter-Chip Interconnects(chip-to-chip)

• Current challenges• Photonic Networks-on-Chip• Design Considerations• SPINet

Page 43: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(42)COLUMBIAUNIVERSITY

International technology Roadmap for semiconductors

CPU/off chip bandwidth performance gapHigh performance computing systemsDistributed Shared-Memory (DSM) Microprocessors

• Shared address space by physically distributing memory among many processors

• Fundamental DSM communications bottleneck: remote memory access latency

• Emerging performance gap between CPU bandwidth and off-chip clock rates; fundamental limits reached on multi-GHz electronic signaling (power dissipation)

Page 44: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(43)COLUMBIAUNIVERSITY

Photonic Integrated Networks

• But: Packet size doesn’t scale down– Typical packet > 10 ns ≈ 2 m (silica fiber)

• Large Scale Photonic (O/E) Integration– Break the optical cost barrier– Very low power dissipation

• Novel, buffer-less architecture, using transparent lightpaths for acknowledge echo

Message size ~ 0.1 to 1meterNoC size ~ 100µm to 1cm

Message head

Message tail

Page 45: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(44)COLUMBIAUNIVERSITY

Rationale for Integrated OIN

• 64-port network-on-chip with >4Tbit/sec• MIP Latency 100ns range• Power dissipation 10X-100X below current

electronic interconnect networks that deliver fraction of throughput bandwidth

• Integration of universal programmable 2x2 multi-wavelength switching building block

Page 46: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(45)COLUMBIAUNIVERSITY

Programmable Multi-WavelengthSwitching Building Block

λF0λA0

50:50

50:50

70:30

70:30

SOA

SOA

O/EO/E CPLDO/EO/E

λF1λA1

50:50

50:50

SOA

in050:50

50:50

in1

out0

out1SOA

Page 47: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(46)COLUMBIAUNIVERSITY

Prototype Switching Node

λF0λA0

50:50

50:50

70:30

70:30

SOA

SOA

O/EO/E control

logicO/EO/E

λF1λA1

50:50

50:50

SOA

in0

50:50

50:50

in1

out0

out1SOA

Six switching states

interchange straight upperstraight

upperinterchange

lowerstraight

lowerinterchange

Page 48: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(47)COLUMBIAUNIVERSITY

• Ultra-broadband each messageencompasses entire WDM bandwidth

• Multi-wavelength 2×2 switch elements• Simple WDM address, on-the-fly

single bit routing, self de-conflictingno buffers, low latency

• Instantaneous lightpaths every time slot• Contentions resolved by dropping• Physical layer acknowledgements

SPINetSPINet: Scalable Photonic Integrated Network

Multistage interconnection network (MIN)

Page 49: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(48)COLUMBIAUNIVERSITY

DemonstrationMessages:

0 1

2 1

3 5

5 4

7 3

0.0

0.1

0.2

0.3

1.0

1.1

1.2

1.3

2.0

2.1

2.2

2.3

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

1

3

4

5 Ackssent

Paths torndown

Page 50: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(49)COLUMBIAUNIVERSITY

Acceptance Rate

64-port network

Average BW/port:

0.25•0.83•320 Gb/s = 64 Gb/s

Network BW:

64 Gb/s•64 = 4 Tb/s !

Page 51: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(50)COLUMBIAUNIVERSITY

Latency

64-port network

Mean Queuing latency (load=0.25): 0.46 slots

Page 52: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(51)COLUMBIAUNIVERSITY

4 node experimental implementation

• optical waveforms of signals at network’s input and output ports• demonstrated correct routing and contention resolution between optical packets.

Page 53: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(52)COLUMBIAUNIVERSITY

Intra-Chip Interconnects

Intra-Chip Interconnects(on-chip)

• Current on-chip interconnects challenges

• Emergence of Multi-Cores• Photonic Opportunities • Design Considerations• Intra-Chip Networking

Page 54: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(53)COLUMBIAUNIVERSITY

Paradigm shift in high-performance processor chip design

• Before: exponential performance acceleration with each generation via increased clock frequencies and integration densities

• As clock frequencies rise, fraction of chip reachable in a single clock cycle is decreasing by same exponential rate

• Diminishing returns: increasing processor frequency increased instruction execution latencies performance can degrade

• Now: Designers limited not by number of transistors integrated on a single die but by logic reachable within one clock cycle

• Power dissipation and optimization of performance per Watt leading trend toward multi-core parallel processors

• High-performance processor chips are distributed systems

• Evolution from computation towards communication-bound design

Page 55: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(54)COLUMBIAUNIVERSITY

Critical Roadblocks and the Photonics Opportunity

• On-chip communications latency– Time-of-flight: RC delay worse with each process generation– “The intrinsic interconnect delay of a 1-mm interconnect for a

35-nm technology will be longer than the MOSFET switching delay by two orders of magnitude” [Davis et al., IEEE Proc. ‘01]

– Optical signal velocity independent of data rate– Serialization latency: optical TDM compression– Quequeing latency: bufferless interconnection network with

guaranteed queueing-free paths for latency sensitive packets– Latency insensitive design (LID): EDA tools

• Exacerbated growth in power dissipation– Propagation power dissipation independent of optical signal rate– Power efficient design in photonic switching

Page 56: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(55)COLUMBIAUNIVERSITY

On-chip interconnect latency

0

20

40

60

80

100

250 180 130 100 80 60

• “For a 60-nanometer process a signal can reach only 5% of the die’s length in a clock cycle” [D. Matzke (Texas Instruments), IEEE Computer Sept. 97]

• Shift from function-centric to communication-centric design

16 cycles8 cycles

4 cycles

1 cycle2 cycles

[nm technology]

[% o

f rea

chab

le d

ie]

Page 57: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(56)COLUMBIAUNIVERSITY

Processor Chips Become Distributed Systems

• Interconnect Latency– Interconnect delays can be an order of magnitude larger than switching delays– Hard to estimate because affected by many phenomena

• process variations, cross-talk, power-supply drop variations– Breaks the synchronous assumption

• that lies at the basis of design automation tool flows

Local (scaled-length) wires• span a fixed number of gates,

scale well together with logic

Global (fixed-length) wires• span a fixed fraction of a die, do

not scale

scaling

[Ho et al., 2001]

Page 58: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(57)COLUMBIAUNIVERSITY

Interconnect Power Dissipation• Power dissipation is arguably

the most critical problem in high-performance chip design

• Over last two decades microprocessor power dissipation grows exponentially and primary contribution from interconnects [H

orow

itz

et a

l., 2

005]

Pow

er (W

atts

)

Year

Interconnect responsible for 50% of dynamic power dissipation[Magen et al., 2004]

Page 59: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(58)COLUMBIAUNIVERSITY

The Rise of Multi-Core Architectures• Rise of parallel multi-core architectures

to mitigate power dissipation

• Parallel architectures with multiple simpler processing cores provide better performance per watt than architectures based on a single complex processor

• State-of- the-art commercial chips feature more parallel and distributed architectures that are essentially multi-core chips– Montecito (Intel)– Cell (IBM, Toshiba, Sony)

• Key is to design robust, scalable, fast, and power-efficient:

intra-chip communication networks

Page 60: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(59)COLUMBIAUNIVERSITY

Optical interconnection networks-on-chip• photonic intra-chip interconnection networks create

potentially disruptive technology:– Ultra-high throughput– Minimal access latencies– Low power dissipation, independent of capacity

• Globally shared optical network, regular topology– Local electronic interconnect– Electronic computation

• Architecture, data routing designed for photonics:– Optical buffering not practical on chip– No significant processing in optical domain

Page 61: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(60)COLUMBIAUNIVERSITY

Manhattan Street Network (MSN) Optical Interconnection

• Regular shared topology:– Torus replaced by mesh– Dense grid of unidirectional waveguides– Adjacent are directed in opposite directions

(like Manhattan's streets and avenues)

• Simple 2x2 switch elements:– 2-state operation – No buffering– Routing logic in parallel electronic

control plane– Power dissipation during transitions

• Tx/Rx pair for major on-chip modules (processor cores) at specific grid addresses

• Asynchronous operation:– Modules transmit/receive packets– Very simple photonic switching elements – Asynchronous messages stretch over SEs.

AC

Page 62: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(61)COLUMBIAUNIVERSITY

Asynchronous electronic/optical master/slave routing

• Optical interconnection network functions in synergy with an electronic control plane that mimics the photonic network topology

• Parallel electronic control network: – control packets exchanged to provide path

setup/release requests– acknowledgments functionality– Employ path diversity, deflect around used paths– Electronic router controlling every photonic switching

element (PSE) at every intersection of the MSN grid

Page 63: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(62)COLUMBIAUNIVERSITY

Routing and Data Flow Control• Setup of paths accomplished as optical burst switching:

– Source wants to send a message– Sends an electronic path setup packet (PSP)

• Encodes several fields: including control packet type, destination address (X and Y coordinates), priority, and source address

– Electronic PSP travels through parallel control network, setting up routers and PSEs on its way

– No buffering takes place at any point– Each router has only two inputs and two outputs an available output port

(for deflection) always exists• Decision at every router computed by comparing coordinates of the

router/PSE with the destination coordinates of the packet • Message payload transmission in the optical domain immediately

follows as a comet tail with the electronic PSP setting up the path• After the optical payload transmission ends, a path release packet

(PRP) is sent to reset all the routers and PSEs

Page 64: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

TX

RX

RX

TX

TX RX RX TX

Page 65: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

TX RX

TX

RX

RX

TX

RX TX

electronic signal detects a used path and alters its route

if paths overlap, one

must be rerouted

Page 66: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(65)COLUMBIAUNIVERSITY

Throughput, Latency, Dynamic Programmability

• Packet payload transparency enables enormous scaling in capacity– dynamic support of variable packet sizes– message can extend many PSEs, create lightpath circuits

• Asynchronous design, shared multiple global communications

• Heavily path diversified network• Different algorithms (X-first, Y-first, mixture) used to

balance load and avoid local congestions• Programmable fields in electronic PSP enables multiple

classes of service

Page 67: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(66)COLUMBIAUNIVERSITY

Asynchronous, Bufferless MSN

• Asynchronous bufferless network, based on Manhattan Street Network topology – deflection routing

• Can provide a guaranteed queueing-free path for latency sensitive signals.

• Path diversity to reduce load– Differentiated services to different classes

• Latency guarantees verified by simulations

Page 68: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(67)COLUMBIAUNIVERSITY

Differentiated Classes of Service• Provides guaranteed queueing-free path for latency sensitive signals

• Different classes of service defined:– real-time signaling: dedicated path, preempting any other traffic.

• path re-used to route other traffic when not used by the real-time signal– Guaranteed bandwidth and CBR (constant bit rate)

• some paths designed to be time-multiplexed with long-lasting connections that require a guaranteed bit rate.

– Best effort: for non-latency sensitive applications • the vast bandwidth offered by the network

• Latency addressed by: fast propagation velocity, path diversity, and the complete avoidance of buffering

• Can secure deflection-free, buffering-free paths to a small number of high priority signals.

Page 69: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(68)COLUMBIAUNIVERSITY

Initial Simulations Results

Page 70: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(69)COLUMBIAUNIVERSITY

Summary and ConclusionsMultiple Opportunities for Insertion of Photonics

leveraging unique features of both photonics and electronics

• unique architectures to allow for synergy• multiple-wavelength transmission to maximize bandwidth• transparent optical pathways• power efficiency• design for high acceptance rates• design enabling technologies in Si photonics for network• simplicity, repeatability since photonics are still nascent

Page 71: Photonic Networks for Intra-Chip, Inter-Chip, and Box ...lightwave.ee.columbia.edu/files/Bergman_ECOC2006.pdf · Box Interconnects in High-Performance Computing ... 50:50 50:50 70:30

(70) Lightwave Research LaboratoryCOLUMBIAUNIVERSITY

Bibliography1. J. Protic, M. Tomasevic, V. Milutinovic , “Distributed Shared Memory: Concepts and Systems,” IEEE Parallel & Distributed

Technology, vol. 4, no. 2, Summer 1996, pp. 63–79. 2. D. Dai and D. K. Panda, “How Can We Design Better Networks for DSM Systems?” Lecture Notes in Computer Science, vol.

1417, pp. 171-184, Jan. 1998. 3. J. P. G. Sterbenz and J. D. Touch, High-Speed Networking: A Systematic Approach to High-Bandwidth Low-Latency

Communication, New York, NY: Wiley and sons, 2001. 4. W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks, San Francisco, CA: Morgan Kaufmann, 2004.5. D. A. B. Miller, “Rationale and Challenges for Optical Interconnects to Electronic Chips,” Proc. IEEE, vol. 88, pp. 728-748,

June 2000. 6. R. Luijten, C. Minkenberg, R. Hemenway, M. Sauer, R. Grzybowski, “Viable Opto-electronic HPC Interconnect Fabrics,” in

Proc. ACM/IEEE (SC|05) Conf. Supercomputing, Seattle, WA, Nov. 2005, pp. 18-18. 7. K. Kodi and A. Louri, “Design of a High-Speed Optical Interconnect for Scalable Shared-Memory Multiprocessors,” IEEE

Micro, vol. 25, no. 1, pp. 41-49, Jan/Feb 2005. 8. A. Shacham, B.A. Small, O. Liboiron-Ladouceur, K. Bergman, “A Fully Implemented 12x12 Data Vortex Optical Packet

Switching Interconnection Network,” J. Lightwave Technol., vol. 23, no. 10, pp. 3066-3075, Oct. 2005. 9. R. Nagarajan, et al., “Large-Scale Photonic Integrated Circuits,” IEEE J. Select. Topics Quantum Electron., vol. 11, no. 1, pp.

50-65, Jan./Feb. 2005. 10. M. Lipson, “ Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities”, IEEE Journal of Lightwave

Technologies, Vol. 23, No. 12, 12 December 2005 (invited). 11. C. Gunn, “CMOS Photonics for High-Speed Interconnects,” IEEE Micro, vol. 26, no. 2, pp. 58-66, Mar./Apr. 2006. 12. B. A. Small, T. Kato, K. Bergman, “Dynamic Power Considerations in a Complete 12x12 Optical Packet Switching Fabric,”

IEEE Photon. Technol. Lett., vol. 17, no. 11, pp. 2472-2474, Nov. 2005. 13. A. Shacham, B. G. Lee, K. Bergman, “A Scalable, Self-Routed, Terabit Capacity, Photonic Interconnection Network,” in Proc.

13th Annu. IEEE Symp. on High Performance Interconnects (Hot Interconnects), Stanford, CA, Aug. 2005, pp. 147-150. 14. A. Shacham, B. G. Lee, K. Bergman, “A Wideband, Non-Blocking, 2x2 Switching Node for a SPINet Network,” IEEE Photon.

Technol. Lett., vol. 17, no. 12, pp. 2742-2744, Dec. 2005. 15. A. Pattavina, Switching Theory – Architecture and Performance in Broadband ATM Networks, West Sussex, UK: Wiley &

Sons, 1998. 16. A. Shacham and K. Bergman, “Utilizing Path Diversity in Optical Packet Switched Interconnection Networks,” in Proc. Optical

Fiber Commun. Conf. (OFC 2006), Anaheim, CA, Mar. 2006, OTuN5. 17. D. S. Meliksetian and C. Y. R. Chen, “A Markov-Modulated Bernoulli Process Approximation for the Analysis of Banyan

Networks,” in Proc. ACM SIGMETRICS, Santa Clara, CA, 1993, pp. 183-194. 18. L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Coping with latency in SOC design,” IEEE Micro, 22(5):24–35, Sep-Oct

2002. 19. HECRTF The High-End Computing Revitalization Task Force. Federal plan for high-end computing. Available at

http://www.nitrd.gov/subcommittee/hec/hecrtf-outreach/. 20. A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. E. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G.

V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, , and P. Vranas, “Overview of the blue gene/lsystem architecture,” IBM J. Res. Develop., 49(2-3):195–212, May 2005.

21. Committee on the Future of Supercomputing. “Getting up to speed: The future of supercomputing,” Available athttp://www.nap.edu/catalog/11148.html.


Recommended