+ All Categories
Home > Documents > Transport Layer Enhancements for Unified Ethernet in Data Centers

Transport Layer Enhancements for Unified Ethernet in Data Centers

Date post: 12-Jan-2016
Category:
Upload: amos
View: 32 times
Download: 0 times
Share this document with a friend
Description:
Transport Layer Enhancements for Unified Ethernet in Data Centers. K. Kant Raj Ramanujan Intel Corp. Exploratory work only, not a committed Intel position. Context. Data center is evolving  Fabric should too. Last talk: Enhancements to Ethernet, already on track This talk: - PowerPoint PPT Presentation
26
Transport Layer Transport Layer Enhancements Enhancements for Unified for Unified Ethernet in Ethernet in Data Centers Data Centers K. Kant K. Kant Raj Ramanujan Raj Ramanujan Intel Corp Intel Corp Exploratory work only, not a committed Intel positi Exploratory work only, not a committed Intel positi
Transcript
Page 1: Transport Layer Enhancements for Unified Ethernet in Data Centers

Transport Layer Transport Layer Enhancements Enhancements for Unified for Unified Ethernet in Ethernet in Data CentersData Centers

K. KantK. KantRaj RamanujanRaj Ramanujan

Intel CorpIntel Corp

Exploratory work only, not a committed Intel positionExploratory work only, not a committed Intel position

Page 2: Transport Layer Enhancements for Unified Ethernet in Data Centers

2*Third party marks and brands are the property of their respective owners

InsertLogoHere

ContextContext Data center is evolving Data center is evolving

Fabric should too.Fabric should too. Last talk: Last talk:

–Enhancements to Ethernet, already on trackEnhancements to Ethernet, already on track This talk:This talk:

–Enhancements to Transport LayerEnhancements to Transport Layer–Exploratory, not in any standards track.Exploratory, not in any standards track.

Page 3: Transport Layer Enhancements for Unified Ethernet in Data Centers

3*Third party marks and brands are the property of their respective owners

InsertLogoHere

OutlineOutline

–Data Center evolution & transport Data Center evolution & transport impactimpact

–Transport deficiencies & remediesTransport deficiencies & remedies– Many areas of deficiencies …Many areas of deficiencies …

– Only Congestion Control and QoS Only Congestion Control and QoS addressed in detailaddressed in detail

–Summary & Call to ActionSummary & Call to Action

Page 4: Transport Layer Enhancements for Unified Ethernet in Data Centers

4*Third party marks and brands are the property of their respective owners

InsertLogoHere

Data Center TodayData Center Today

Tiered structure Tiered structure Multiple Multiple incompatibleincompatible fabrics fabrics

– Ethernet, Fiber Channel, IBA, Myrinet, etc.Ethernet, Fiber Channel, IBA, Myrinet, etc.– Management complexityManagement complexity

Dedicated servers for applications Dedicated servers for applications Inflexible Inflexible resource usageresource usage

business trans

client req/ resp

Storage

Fabric

network Fabric

SAN storage

database query

IPC Fabric

Page 5: Transport Layer Enhancements for Unified Ethernet in Data Centers

5*Third party marks and brands are the property of their respective owners

InsertLogoHere

Future DC: Stage 1 – Fabric Future DC: Stage 1 – Fabric UnificationUnification

Enet dominant, but convergence really on IP.Enet dominant, but convergence really on IP.– New layer2: PCI-Exp, Optical, WLAN, UWB, …New layer2: PCI-Exp, Optical, WLAN, UWB, …

Most ULP’s run over transport over IPMost ULP’s run over transport over IP Need to comprehend transport implications Need to comprehend transport implications

business trans

client req/ resp

iSCSI storage

database query

Page 6: Transport Layer Enhancements for Unified Ethernet in Data Centers

6*Third party marks and brands are the property of their respective owners

InsertLogoHere

Future DC: Stage 2 – Clustering Future DC: Stage 2 – Clustering & Virtualization& Virtualization

Sub-cluster1

Sub-cluster 2

Sub-cluster 3

Storage Nodes

SMP SMP Cluster (cost, flexibility, …) Cluster (cost, flexibility, …) Virtualization Virtualization

– Nodes, network, storage, … Nodes, network, storage, … Virtual clusters (VC) Virtual clusters (VC)

– Each VC may have multiple traffic types insideEach VC may have multiple traffic types inside

VirtualCluster1

VirtualCluster 2 Virtual

Cluster 3

IP ntwk

Page 7: Transport Layer Enhancements for Unified Ethernet in Data Centers

7*Third party marks and brands are the property of their respective owners

InsertLogoHere

Future DC: New Usage Future DC: New Usage ModelsModels Dynamically provisioned virtual clustersDynamically provisioned virtual clusters Distributed storage (per node) Distributed storage (per node) Streaming traffic (VoIP/IPTV + data services)Streaming traffic (VoIP/IPTV + data services) HPC in DCHPC in DC

– Data mining for focused advertising, pricing, …Data mining for focused advertising, pricing, …

Special purpose nodesSpecial purpose nodes– Protocol accelerators (XML, authentication, etc.)Protocol accelerators (XML, authentication, etc.)

New models New models New fabric requirements New fabric requirements

Page 8: Transport Layer Enhancements for Unified Ethernet in Data Centers

8*Third party marks and brands are the property of their respective owners

InsertLogoHere

Fabric ImpactFabric Impact More types of traffic, more demanding needs.More types of traffic, more demanding needs. Protocol impact at all levelsProtocol impact at all levels

– Ethernet: Previous presentation.Ethernet: Previous presentation.

– IP: Change affects entire infrastructure.IP: Change affects entire infrastructure.

– Transport: This talkTransport: This talk

Why transport focus?Why transport focus?– Change Change primarilyprimarily confined to endpoints. confined to endpoints.

– Many app needs relate to transport layerMany app needs relate to transport layer

– App. interface (Sockets/RDMA) mostly unchanged.App. interface (Sockets/RDMA) mostly unchanged.

DC evolution DC evolution Transport evolution Transport evolution

Page 9: Transport Layer Enhancements for Unified Ethernet in Data Centers

9*Third party marks and brands are the property of their respective owners

InsertLogoHere

Transport Issues & Transport Issues & enhancementsenhancements Transport (TCP) enhancement areasTransport (TCP) enhancement areas

– Better Congestion control and QoSBetter Congestion control and QoS– Support media evolutionSupport media evolution– Support for high availabilitySupport for high availability– Many othersMany others

– Message based & unordered data delivery.Message based & unordered data delivery.– Connection migration in virtual clusters.Connection migration in virtual clusters.– Transport layer multicasting.Transport layer multicasting.

How do we enhance transport?How do we enhance transport?– New TCP compatible protocol? New TCP compatible protocol? – Use an existing protocol (SCTP)?Use an existing protocol (SCTP)?– Evolutionary changes to TCP from DC perspective.Evolutionary changes to TCP from DC perspective.

Page 10: Transport Layer Enhancements for Unified Ethernet in Data Centers

10*Third party marks and brands are the property of their respective owners

InsertLogoHere

What’s wrong with TCP What’s wrong with TCP Congestion controlCongestion control

TCP congestion control (CC) works TCP congestion control (CC) works independentlyindependently for each connection for each connection – By default TCP equalizes throughput By default TCP equalizes throughput undesirable undesirable

– Sophisticated QoS can change this, but …Sophisticated QoS can change this, but …

Lower level CC Lower level CC Backpressure on transport Backpressure on transport – Transport layer congestion control is crucialTransport layer congestion control is crucial

MACMAC

routerswitch switch

Congfeedback

TL cong cntrl IP

MAC

Apptranspo

rtIP

MAC

ECN/ICMPApptranspo

rtIP

MAC

Page 11: Transport Layer Enhancements for Unified Ethernet in Data Centers

11*Third party marks and brands are the property of their respective owners

InsertLogoHere

What’s wrong with QoS?What’s wrong with QoS? Elaborate mechanismsElaborate mechanisms

– Intserv (RSVP), Diffserv, BW broker, …Intserv (RSVP), Diffserv, BW broker, …

… … But a nightmare to useBut a nightmare to use– App knowledge, many parameters, sensitivity, …App knowledge, many parameters, sensitivity, …

What do we need?What do we need?– Simple/intuitive parameters Simple/intuitive parameters

– e.g., streaming or not, normal vs. premium, etc.e.g., streaming or not, normal vs. premium, etc.

– Automatic estimation of BW needs.Automatic estimation of BW needs.– Application focus, not flow focus!Application focus, not flow focus!

QoS relevant primarily under congestionQoS relevant primarily under congestion

Fix TCP congestion control, use IP QoS sparingly.Fix TCP congestion control, use IP QoS sparingly.

Page 12: Transport Layer Enhancements for Unified Ethernet in Data Centers

12*Third party marks and brands are the property of their respective owners

InsertLogoHere

TCP Congestion Control TCP Congestion Control EnhancementsEnhancements1)1) Collective control of all flows of an appCollective control of all flows of an app

– Applicable to both TCP & UDPApplicable to both TCP & UDP– Ensures proportional fairness of multiple Ensures proportional fairness of multiple inter-inter-

relatedrelated flowsflows– Tagging of connections to identify related flows.Tagging of connections to identify related flows.

2)2) Packet loss highly undesirable in DCPacket loss highly undesirable in DC– Move towards a delay based TCP variant.Move towards a delay based TCP variant.

3)3) Multilevel CoordinationMultilevel Coordination– Socket vs. RDMA apps, TCP vs. UDP, … Socket vs. RDMA apps, TCP vs. UDP, … – A layer above transport for coordinationA layer above transport for coordination

Page 13: Transport Layer Enhancements for Unified Ethernet in Data Centers

13*Third party marks and brands are the property of their respective owners

InsertLogoHere

Collective Congestion Collective Congestion ControlControl Control connections thru a congested device Control connections thru a congested device

together (control set)together (control set) Determining control set is challengingDetermining control set is challenging BW requirement estimated automatically BW requirement estimated automatically

during non-congested periodsduring non-congested periods

Cong. Control

S21

S23

SW1SW2

CL1

SW0

S11

S13

CL2

Page 14: Transport Layer Enhancements for Unified Ethernet in Data Centers

14*Third party marks and brands are the property of their respective owners

InsertLogoHere

Sample Collective ControlSample Collective Control App 1: App 1: client1 client1 server1 server1

–Database queries Database queries over a over a single connectionsingle connection Drives ~5.0 Mb/s BWDrives ~5.0 Mb/s BW

App2: App2: client2 client2 server1 server1–Similar to App1Similar to App1 Drives 2.5 Mb/s BWDrives 2.5 Mb/s BW

App 3: App 3: client3 client3 server2 server2–FTP, starts at t=30 secsFTP, starts at t=30 secs 25 conn. 25 conn. 8 Mb/s 8 Mb/s

Page 15: Transport Layer Enhancements for Unified Ethernet in Data Centers

15*Third party marks and brands are the property of their respective owners

InsertLogoHere

Sample ResultsSample Results Cong. Control

Collective control highly desirable within a DC

Modified TCP can maintain 2:1 throughput ratio Modified TCP can maintain 2:1 throughput ratio – Also yields lower losses & smaller RTT.Also yields lower losses & smaller RTT.

Page 16: Transport Layer Enhancements for Unified Ethernet in Data Centers

16*Third party marks and brands are the property of their respective owners

InsertLogoHere

Adaptation to MediaAdaptation to Media Problem:Problem: TCP assumes loss TCP assumes loss congestion, congestion,

and designed for WAN (high loss/delay)and designed for WAN (high loss/delay) Effects:Effects:

– Wireless (e.g. UWB) attractive in DC (wiring Wireless (e.g. UWB) attractive in DC (wiring reduction, mobility, self configuration).reduction, mobility, self configuration).

– … … but TCP is not a suitable transport.but TCP is not a suitable transport.– Overkill for communications within a DC.Overkill for communications within a DC.

Solution:Solution: A self-adjusting transport A self-adjusting transport– Support multiple congestion/flow-control regimes.Support multiple congestion/flow-control regimes.

– Automatically selected during connection setup.Automatically selected during connection setup.

Page 17: Transport Layer Enhancements for Unified Ethernet in Data Centers

17*Third party marks and brands are the property of their respective owners

InsertLogoHere

High Availability IssuesHigh Availability Issues Problem:Problem: Single failure Single failure broken connection, broken connection,

weak robustness check, …weak robustness check, … Effect:Effect: Difficult to achieve high availability. Difficult to achieve high availability.

A B

Path 1

Path 2

Solution: Solution: – Multi-homed connections w/ load sharing among paths.Multi-homed connections w/ load sharing among paths.

– Ideally, controlled diversity & path managementIdeally, controlled diversity & path management– Difficult: need topology awareness, spanning tree problem, Difficult: need topology awareness, spanning tree problem,

Page 18: Transport Layer Enhancements for Unified Ethernet in Data Centers

18*Third party marks and brands are the property of their respective owners

InsertLogoHere

Summary & call to actionSummary & call to action Data Centers are evolvingData Centers are evolving

– Transport must evolve too, but a difficult Transport must evolve too, but a difficult proposition proposition

– TCP is heavily entrenched, change needs an TCP is heavily entrenched, change needs an industry wide effortindustry wide effort

Call to ActionCall to Action– Need to get an industry effort going to defineNeed to get an industry effort going to define

– New features & their implementationNew features & their implementation

– Deployment & compatibility issues.Deployment & compatibility issues.

– Change will need push from data center Change will need push from data center administrators & planners.administrators & planners.

Page 19: Transport Layer Enhancements for Unified Ethernet in Data Centers

19*Third party marks and brands are the property of their respective owners

InsertLogoHere

Additional ResourcesAdditional Resources

Presentation can be downloaded from Presentation can be downloaded from the IDF web site – when prompted enter:the IDF web site – when prompted enter:

–Username: idfUsername: idf

–Password: fall2005Password: fall2005

Additional backup slidesAdditional backup slides Several relevant papers available at Several relevant papers available at http://http://

kkant.ccwebhost.com/download.htmlkkant.ccwebhost.com/download.html

– Analysis of collective bandwidth control.Analysis of collective bandwidth control.

– SCTP performance in data centers.SCTP performance in data centers.

Page 20: Transport Layer Enhancements for Unified Ethernet in Data Centers

20*Third party marks and brands are the property of their respective owners

InsertLogoHere

BackupBackup

Page 21: Transport Layer Enhancements for Unified Ethernet in Data Centers

21*Third party marks and brands are the property of their respective owners

InsertLogoHere

Comparative Fabric Comparative Fabric FeaturesFeatures

FeatureFeature TCPTCP SCTPSCTP IBAIBA

Scalability to 100 Gb/s Scalability to 100 Gb/s difficultdifficult difficultdifficult Easy?Easy?

Message based & ULP supportMessage based & ULP support NoNo YesYes YesYes

QoS friendly transport?QoS friendly transport? NoNo NoNo YesYes

Virtual channel supportVirtual channel support NoNo NoNo yesyes

DC centric flow/cong. control DC centric flow/cong. control NoNo NoNo YesYes

Point to multipoint communicationPoint to multipoint communication NoNo NoNo YesYes

High availability features High availability features PoorPoor FairFair GoodGood

Offload latency (end-pt only)Offload latency (end-pt only) ~1us~1us >1us>1us <.5us<.5us

Compatible w/ TCP/IP baseCompatible w/ TCP/IP base YesYes limitedlimited

Unordered data delivery Unordered data delivery NoNo YesYes YesYes

Protection against DoS attacksProtection against DoS attacks PoorPoor GoodGood PoorPoor

Multiple traffic streamsMultiple traffic streams NoNo YesYes YesYes

DC requirements

TCP lacks many desirable features; SCTP has some

Page 22: Transport Layer Enhancements for Unified Ethernet in Data Centers

22*Third party marks and brands are the property of their respective owners

InsertLogoHere

Transport Layer QoSTransport Layer QoS Needed at Needed at

multiple levelsmultiple levels– Between transport Between transport

usesuses

– Conn. of a given Conn. of a given transporttransport

– Logical streamsLogical streams

DB App

cntrl data

iSCSIntwk IPC

Web app

text images

page

• May be on two VM’s on same physical machine.

Inter-app

Intra-app

Intra-conn

Intra-conn

• Best BW subdivision to maximize performance?

RequirementsRequirements– Must be compatible with Must be compatible with

lower level QoS lower level QoS – PCI-Exp, MAC, etc.PCI-Exp, MAC, etc.

– Automatic estimation of Automatic estimation of bandwidth requirements bandwidth requirements

– Automatic BW controlAutomatic BW control

Page 23: Transport Layer Enhancements for Unified Ethernet in Data Centers

23*Third party marks and brands are the property of their respective owners

InsertLogoHere

Multicasting in DCMulticasting in DC Software/patch distributionSoftware/patch distribution

– Multicast to all machines w/ same version.Multicast to all machines w/ same version.

– CharacteristicsCharacteristics– Medium to large file transferMedium to large file transfer

– Time to finish matters, BW doesn’t.Time to finish matters, BW doesn’t.

– Scale: 10s to 1000s.Scale: 10s to 1000s.

High performance computingHigh performance computing– MPI collectives need multicastingMPI collectives need multicasting

– CharacteristicsCharacteristics– Small but frequent transfersSmall but frequent transfers

– Latency premium, BW not an issue mostly.Latency premium, BW not an issue mostly.

– Scale: 10s to 100’sScale: 10s to 100’s

Page 24: Transport Layer Enhancements for Unified Ethernet in Data Centers

24*Third party marks and brands are the property of their respective owners

InsertLogoHere

Transport layer Transport layer multicastingmulticasting

subnet2 subnet1

outer router

Asubnet2 subnet1

outer router

AIP multicasting TL multicasting

DC needsDC needs IP multicastingIP multicasting TL multicastingTL multicasting

Legacy infrast.Legacy infrast. Needs specialized routersNeeds specialized routers Std. routers adequateStd. routers adequate

Short msgs, Short msgs, dynamic groupdynamic group

Usually designed for long Usually designed for long transferstransfers

Appropriate mechanism? Appropriate mechanism?

Topology aware?Topology aware? Yes (routing alg. based)Yes (routing alg. based) No (Need new mechnisms)No (Need new mechnisms)

Low overhead Low overhead No (Complex mgmnt)No (Complex mgmnt) Simpler, done in TL engineSimpler, done in TL engine

Low latency Low latency Primarily BW focussedPrimarily BW focussed Need latency centric designNeed latency centric design

Reliable mcast. Reliable mcast. Built on topBuilt on top Part of TLPart of TL

Page 25: Transport Layer Enhancements for Unified Ethernet in Data Centers

25*Third party marks and brands are the property of their respective owners

InsertLogoHere

TL multicasting valueTL multicasting value AssumptionsAssumptions

– A 16 node cluster w/ 4-node subclusters.A 16 node cluster w/ 4-node subclusters.– Mcast group: 2 nodes in each sub-Mcast group: 2 nodes in each sub-

clustercluster– Latencies: Latencies:

– endpt: 2 us, ack proc: 1 us, switch: 1 usendpt: 2 us, ack proc: 1 us, switch: 1 us– App-TL interface: 5 usApp-TL interface: 5 us

Latency w/o mcastLatency w/o mcast– send: 7x2 + 3x1 + 2 = 19 ussend: 7x2 + 3x1 + 2 = 19 us– ack: 1 + 3x1 + 7x1 = 11 usack: 1 + 3x1 + 7x1 = 11 us– reply: 5 + 2 + 7x2 = 21 usreply: 5 + 2 + 7x2 = 21 us– Total: 19+11+21 = 51 usTotal: 19+11+21 = 51 us

Latency w/ mcastLatency w/ mcast– send: 3x2 + 3x1 + 2 + 2x(1+1) + 2 = 17 ussend: 3x2 + 3x1 + 2 + 2x(1+1) + 2 = 17 us– ack: 1 + 1 + 2x1 + 3x1 + 3x1 = 10 usack: 1 + 1 + 2x1 + 3x1 + 3x1 = 10 us– Total = 17 + 10 + 5 = 32 usTotal = 17 + 10 + 5 = 32 us

Larger savings in full network mcast.Larger savings in full network mcast.

subnet2 subnet1A

subnet3 subnet4outer router

D

B

C

Page 26: Transport Layer Enhancements for Unified Ethernet in Data Centers

26*Third party marks and brands are the property of their respective owners

InsertLogoHere

Hierarchical ConnectionsHierarchical Connections Choose a “leader” in each Choose a “leader” in each

subnet.subnet.– Topology directedTopology directed

Multicast connections to Multicast connections to others nodes via leaders others nodes via leaders – Ack consolidation at leaders Ack consolidation at leaders

(multicast)(multicast)

– Msg consolidation at Msg consolidation at leaders (reverse multicast)leaders (reverse multicast)

Done by a layer above? Done by a layer above? (layer 4.5?)(layer 4.5?)

A

n1 n2

S4

n1 n2

S2

n1 n2

S3

n1 n2

subnet2 subnet1

subnet4subnet3

outer router

A


Recommended