Photonic Technologies for Datacom Οπτικά Δίκτυα Επικοινωνιών
Evolution of Optical Links
As distances go down the number of links goes up putting pressure on power efficiency, density and cost
Why Optical Interconnects?
Beats Copper in Bandwidth × Distance product
Can further improve TCO, power, density
Optical Interconnects Applications Consumer Electronics
Intel with Apple introduced Thunderbolt technology in 2011. It was originally named Light peak and concerned optical connection of external peripherals to a computer.
Optical Thunderbolt cables were introduced in mid April 2012 by Sumitomo Electric Industries.
Current operating speed 20Gb/s (Gen2)
Cost: 230 Euros (10m), by Corning
Optical Interconnects Applications Datacenters: the heart of a content-centric internet
Data center
on premise hardware that stores data within an organization's local network
Cloud data center
an off-premise form of computing that stores data on the Internet
Global data center IP traffic will reach 554 exabytes per month in 2016 (up from 146 exabytes per month in 2011)
By 2016, nearly two-thirds of all workloads will be processed in the cloud
Datacenter Applications and Types Applications (commercial & consumer)
handle the core business and operational data of the organization (ERP, CRM)
multiple services & users (multi-tenancy): databases, file servers, application servers, remote data storage etc.
high performance computing (HPC)
personal content locker
Datacenter Key Considerations Mechanical engineering – save space and costs
Modularity and flexibility
Technology infrastructure design - IT
Environmental control
Electrical Power
Security
Data center in a box
self-contained computing facility that is manufactured in a factory and shipped to a location in a container
modular unit, can be used for assembly of larger structures
Bank of batteries used before diesel generators can start
Datacenter Hierarchical Architecture
Optical interconnects (now)
Electrical interconnects (now)
Optical interconnects (>2020)
EAST WEST
NO
RTH
SO
UTH
Datacenter Hierarchy Levels
Rack-to-Rack
Board-to-board
On-board
On-chip
Rack-to-Rack
Active Optical Cable (AOC)
Broad commercial uptake
Arrays of optical transceivers
Total throughput up to 400 Gb/s
AOC: What’s inside?
Ground
Signal
Ground
Anode
Cathode
Fiber
TSV
Driver IC
TX/RX Array
Taper on chip
AOC commercial state of the art VCSEL – TE Connectivity 400 Gb/s zCD AOC
16x28 Gb/s over 100 m MMF
850 nm VCSEL
zCD interface compatible to CDFP MSA
Silicon photonics – Molex (Luxtera) 400 Gb/s zCD AOC
16x28 Gb/s over 4 km SMF
1550 nm lasers & silicon photonics
zCD interface compatible to CDFP MSA
AOC Block Diagram
CTLE: continuous-time linear equalizer
CML: current mode logic
CDR: clock and data recovery
from SERDES
to host connector
Digital & analogue ICs
Photonic chips
Packaging dominates cost
Electronics dominate power consumption
AOC challenges front panel density -> aggregate bandwidth
Transition from QSFP to CDFP yields over 20% upgrade in front-panel capacity
CDFP (400 Gb/s per AOC): 11 ports, 4.4 Tb/s
QSFP (100 Gb/s per AOC): 32 ports, 3.6 Tb/s
AOC challenges Form factor evolution -> aggregate bandwidth
Example: how do we get to 400 Gb/s per AOC?
Evolutionary steps followed by the industry in AOC development
Increase bitrate per lane – but with less power!
Reduce form factor -> increase front panel density
AOC challenges Challenge when scaling the rate:
compromised electrical signal integrity
more equalization -> higher power
Potential solution: mid-board modules
bonus: higher bandwidth density per panel area
what about reliability? technology? assembly efforts?
memory processor
Mid-Board Modules
SNAP12 module
12x10 Gb/s over 300 m MMF
850 nm VCSELs
MTO/MTP fiber ribbon
BGA attachment on a MEG-array connector (works up to 28 Gb/s)
signal conditioning for higher speeds
Mid-Board Modules
IBM holey optochip
24x12.5 Gb/s over MMF, demonstrated up to 36 Gb/s
850 nm VCSELs AND photodiodes
PGA electrical interface on CoreEZ PCB
Avago MicroPOD
12x14 Gb/s over MMF
850 nm VCSELs
PRIZM® LightTurn® optical turn 1×12 ribbon fiber connector
micro-LGA attachment electrical interface
What’s the next step? Optically-enabled ASIC
Bring the optics further close to the processor
multi-chip module or 3D integrated chip
implementation
Switch SerDes power
dissipation (pj/bit)
Retimer power dissipation
(pj/bit)
Total power dissipation
(pj/bit)
at board edge - AOC
7.0 24 29
on board - MBM 7.0 12 19
on processor package
6.5 NA 6.5
A. Ghiasi, “Is there a need for on-chip photonic integration for large data warehouse switches”, in Proc. IEEE 9th International Conference on Group IV Photonics (GFP), 2012, pp.27-29, 29-31 Aug.
2012
Board-to-board: optoelectronic router
Optoelectronic routers:
Large scale parallel optical interconnect
Optical transceiver is assembled on CMOS
1.3 Tb/s throughput (SOTA)
Large, 12×14 back illuminating VCSEL matrix
On board interconnects
PCB embedded optical waveguides (OPCB)
Single-layer polymer WG process supporting 300x350mm form factor on rigid MLB PCBs (TTM Technologies)
225 Gb/s Bi-Directional Integrated Optical PCB Link (IBM)
On chip interconnects
3D integration enabling on-chip optical interconnection
Super/optical highways (blue) inside an IBM Silicon Nanophotonics chip (speed 25Gb/s per Tx/Rx)
Si Photonics:
integration of different optical components side-by-side with electrical circuits on a single silicon chip in standard 90nm semiconductor fabrication
IBM Research
IBM Research
Scaling Interconnect Speed – How? The phenomenal surge of online data is pushing “traditional” OI technologies to their speed boundaries
Scaling the capacity of parallel optical links has been regularly addressed by either:
increasing the number of parallel “lanes”
enhancing the line rate of each lane
Why Not Follow Traditional Recipe?
current target
cost ($/Gb/s) <12 1
power (mW/Gb/s) <30 5
A higher-speed I/O generation emerges every ~3.5 years
an alternative path is necessary
technological maturity cannot keep pace with rapid demand
information density?
cost, power scaling?
Multiplexing to the Rescue
+
-
+
-
+
-
Which Way to Go?
Datacenter Networking How does a typical datacenter look like?
Fat tree architecture
Servers organized in racks
Top-of-Rack switches handling rack traffic
Racks organized in clusters, or pods (often in a container)
Usually different networks for storage and compute
Datacenter Networking: Architecture Fat-tree network implementation with oversubscription
Oversubscription: the practice of deploying smaller link (and switch) capacity at the higher layers of the fat-tree than the total bandwidth aggregated from lower-layers
Exploits bursty nature of links due to traffic statistics
A crucial tradeoff: lowers the cost of the overall application infrastructure while helping to ensure that application servers receive the appropriate I/O channel resources to meet their application needs.
...
POD 1
...
...
...
x Z x Z x Z
ToR ToR ToR
POD P
...
...
...
x Z x Z x Z
ToR ToR ToR
... ...
Level 1
Top of Rack
Level 2
Aggregation
Level 3
Core
Datacenter Networking: Metrics
EAST WEST
NO
RTH
SO
UTH
Key performance metrics:
Bisection bandwidth
split N nodes into two groups of N/2 nodes such that the bandwidth between these two groups is minimum: that is the bisection bandwidth (worst-case scenario)
why is it relevant: if traffic is completely random, the probability of a message going across the two halves is ½ – if all nodes send a message, the bisection bandwidth will have to be N/2
Latency
the time interval between the stimulation and response
Causes of latency: propagation delay, serialization, data protocols, routing and switching, queuing and buffering
Both crucial for determining the QoS (in addition to bit error rates, availability, throughput)
Limitations of current DC architectures
super-linear scaling
high energy consumption independent of load
cable spaghetti
huge waste of capex and opex since network is underutilized on average
full fat-tree is very expensive
oversubscribed fat-tree (still underutilized on average) can exhibit congestion (hotspot) problems
rigid bandwidth allocation
large latencies for east-west traffic which is dominant (need to travel north-south)
Hybrid optical electrical network architectures
MEMS based hybrid switch datacenter
T T T T
T T T T
T T T T T T T T T T T T
Mux
T T T T T T T T
Mux Mux Mux
pods
Electrical packet switch
Optical circuit switch
Electrical packet switch
transceiver
host 10G Copper 10G Fiber 20G Superlink
3D MEMS switch
MEMS switching time: ~1 ms
optical switches for long-lived flows (circuit switched)
electrical switches for bursts (packet switched)
MEMS based hybrid switch datacenter
optical circuit switched fabric provides essentially unlimited bandwidth that scales without the need for equipment upgrades as network speeds increase
BUT a considerable portion of the traffic in the datacenter is short-lived
data type classification is quite demanding: involves traffic monitoring-prediction over an extremely large network
network reconfiguration (control plane) adds considerable delays: need to inform end-host or switches of how to split the traffic into circuits
Scalability issues: radix of MEMS switches is quite limited (up to 320 port switches commercially available)
commercial products (Plexxi/Calient) based on ring-configuration variant
Wavelength switched datacenter architectures
1
w
…
2
1
w
…
2
1
w
…
2
1
w
…
2
1
w
…
2
1
w
…
2
1 2 w
1 2
w
1 2 w
w+1 w+2
2w
1 2 w
N-w+1 N-w+2
N
N-w+1 N-w+2 N
1 2
w
N-w+1 N-w+2 N
w+1 w+2
2w
N-w+1 N-w+2 N
N-w+1 N-w+2
N
1
w
…
2
1
w
…
2
1
w
…
2
1
w
…
2
1
w
…
2
1
w
…
2
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
host
pod
Arrayed Waveguide Grating (AWG)
AWG: passive device
tunable laser: ≥ 50 ns switching speed
Tunable Laser Tx
Wavelength switched datacenter architectures
tunable lasers can have ns-scale tuning speed: operation at packet granularity
AWGs are passive components: cheap, very low crosstalk, no impairments to optical signal
scalability issues: limited number of wavelengths available (80 in the C-band)
cost issues: tunable lasers more expensive than fixed (compensated by switch costs)
smart scheduling and synchronization necessary (control plane) to reap the benefits of fast laser tunability
WSS based datacenter architectures
Electrical packet switch
pod
Wavelength Selective Switch (WSS)
WDM ring
WSS reconfiguration time: ~100 ms or lower
optical switches for long-lived flows (circuit switched)
electrical switches for bursts (packet switched)
WDM ring: physical mesh architecture
WSS based datacenter architectures
logical mesh architecture, well-suited to east-west traffic profiles
WDM+WSS provide ample reconfigurability
similar issues with MEMS-based architectures regarding data classification and network control
fast WSS technologies can be employed but development is necessary
very high cost of WSS components (shared among a small number of end hosts)
scalability issues: typical WSSs have small port count (1x4, 1x9, 1x23)
Optical packet switched architectures
…
…
…
λ1
λM
λ1
λM
label processor
1xf switch
control
photonic switch
photonic switch
wavelength selector
WC at λ1
control
contention resolution
contention resolution
photonic switch
photonic switch
contention resolution
contention resolution
…
…
…
…
…
λ1
λM
λ1
λM
1
F
1
F
1
F
1
F
1
F
1
F
1
F
1
F
…
…
IN 1
IN F
OUT 1
OUT F
ToR ToR
Semiconductor Optical Amplifier (SOA)
PLZT switch
few ps switching time
Optical packet switched architectures
very fast reconfiguration time, enables optical packet switching
scalability issues: SOA and PLZT switches are small-scale. MEMS switches are used in parallel to handle circuit-switched traffic, which is separated from packet-switched at the ToR.
noise (SOAs) and crosstalk (both) limit cascadeability
power consumption (SOAs)
missing an established supply chain
smart scheduling and synchronization necessary (control plane) to reap the benefits of fast switching time and enable optical buffering
Is optical switching the way to go?
optical switching architectures have the potential for significant performance enhancement and cost savings (in CAPEX and OPEX)
first commercial systems are already on the market
main open issues:
scalability
component cost: telecom technologies too expensive for datacom
control plane & scheduling: it is essential to maintain maximum backwards compatibility with current standards and overall datacenter ecosystem
Software-Defined Networking (SDN)
Vertically integrated Closed, proprietary
Slow innovation Small industry
Specialized Operating
System
Specialized Hardware
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p App
Specialized Applications
Horizontal Open interfaces Rapid innovation
Huge industry
Microprocessor
Open Interface
Linux Mac
OS
Windows
(OS) or or
Open Interface
from mainframe to PCs
Source: N. McKewon et al., “How SDN will shape networking”
SDN: Gear shift in networking
Source: N. McKewon et al., “How SDN will shape networking”
Vertically integrated Closed, proprietary
Slow innovation
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p
Ap
p App
Horizontal Open interfaces Rapid innovation
Control
Plane
Control
Plane
Control
Plane or or
Open Interface
Specialized Control Plane
Specialized Hardware
Specialized Features
Merchant Switching Chips
Open Interface
SDN: centralizing network control distributed network control
centralized network control (SDN)
routers make localized decisions trying to optimize their packet throughput
take in packets, look up their forwarding address and send them on the shortest route to the next router
a centralized controller orchestrates decisions based on a holistic view of the network
the majority of packets are grouped into higher layer flows such as over-the-top video, database backups and virtual machine movements
Software Defined Networking Approach
Control Program Control Program
Network OS
1. Open interface to packet forwarding
2. At least one Network OS, probably many.
Open- and closed-source
Packet Forwarding
Packet
Forwarding
Packet Forwarding
Packet Forwarding
Packet Forwarding
Global Network View
true decoupling of the control and forwarding planes
enables network virtualization
network virtualization creates logical, virtual networks that are decoupled from the underlying network hardware to ensure the network can better integrate with and support increasingly virtual environments.
Software Defined Networking Approach
Control Program Control Program
Network OS
Packet Forwarding
Packet Forwarding
Packet Forwarding
Flow
Table(s)
“If header = p, send to port 4”
“If header = ?, send it to me for further
processing”
“If header = q, overwrite header with r,
add header s, and send to ports 5,6”
Open interfaces
add a flow table to any generic switching element (e.g. router)
flow table resides in the switch but only flow controller is allowed to make changes to the flow table
flows are streams of packets with similar parameters: use wildcards to define and operate on the flows
share workload between controller and local switch: centralized controller issues flow rules with wildcards (large scale decisions) while the switch maintains the flow table and makes packet level decisions
OpenFlow Protocol
OpenFlow protocol describes message exchanges that take place between an OpenFlow controller and an OpenFlow switch.
OpenFlow protocol enables network controller to perform add, update, and delete actions to entries in flow tables
For an OpenFlow switch, a flow is a sequence of packets that matches a specific entry in a flow table. The definition is packet-oriented, in the sense that it is a function of the values of header fields of the packets that constitute the flow, and not a function of the path they follow through the network.
Northbound and Southbound Interfaces
Northbound interface: communication from application and network orchestrator to the SDN controller (not standardized yet, proprietary implementations exist).
Southbound interface: communication from SDN controller to the network elements (e.g. OpenFlow).
OpenFlow protocol enables network controller to perform add, update, and delete actions to entries in flow tables
Software Defined Datacenter
datacenter storage and compute virtualized resources are already virtualized and controlled by a Hypervisor. Network layer virtualized through SDN.
an orchestration layer is used to integrate different controllers
virtualization makes layering, paralleling and coordination of different elements easy
next stop: disaggregation (Intel, Facebook).
physically decouple computing, memory, storage and communication resources of servers in a datacenter
share these resources and use them on-demand for optimum resource utilization
more demand for high-capacity, low latency and low cost optical links (e.g. silicon photonics)