Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | cumulus-networks |
View: | 226 times |
Download: | 0 times |
v
BGP in the Datacenter
Pete Lumbis – @PeteCCDEDatacenter Architect
CCIE #28677, CCDE 2012::3
cumulusnetworks.com 1
Pete Who?
CCIE R&S #28677, CCDE 2012::3
Former Cisco TAC Routing Escalation
Current Cumulus Networks SE
DC Automation and Architecture
Agenda
The history of L2Routing in the datacenterBGP in the datacenterTroubleshooting improvementsBGP on Servers
cumulusnetworks.com 3
In the Beginning…
There was L2…
cumulusnetworks.com 4
In the Beginning…
…but it had problems
cumulusnetworks.com 5
50% bandwidth loss due to STP
In the Beginning…
…but it had problems
cumulusnetworks.com 6
Unexpected Root change
Root
In the Beginning…
…but it had problems
cumulusnetworks.com 7
STP Brownout
Flooding!
Temporary loops!
STP Block on TCN!
Agenda
The history of L2Routing in the datacenterBGP in the datacenterTroubleshooting improvementsBGP on Servers
cumulusnetworks.com 8
Layer 3 Clos
cumulusnetworks.com 9
Server gateway is attached Leaf
Routing Between Spine and Leafs
10.1.1.0/24 10.2.2.0/24 10.3.3.0/24
OSPF or BGP
Layer 3 – Spine and Leaf
cumulusnetworks.com 10
Full ECMP
Layer 3 – Spine and Leaf
cumulusnetworks.com 11
Full ECMPManageable Oversubscription
48 x 10Gig = 480 Gigs
2 x 40Gig = 80 Gigs = 6:1 Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 12
Full ECMPManageable Oversubscription
Easy to Adjust
48 x 10Gig = 480 Gigs
2 x 40Gig = 80 Gigs = 6:1 Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 13
Full ECMPManageable Oversubscription
Easy to Adjust
48 x 10Gig = 480 Gigs
3 x 40Gig = 120 Gigs = 4:1 Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 14
Full ECMPManageable Oversubscription
Easy to Adjust
48 x 10Gig = 480 Gigs
3 x 40Gig = 120 Gigs = 4:1 Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 15
Full ECMPManageable Oversubscription
Easy to AdjustMassive Scale
48 x 10Gig = 480 Gigs
3 x 40Gig = 120 Gigs = 4:1 Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 16
Full ECMPManageable Oversubscription
Easy to AdjustMassive ScaleControlled Failures
Leaf Failure Reduces Compute
Layer 3 – Spine and Leaf
cumulusnetworks.com 17
Full ECMPManageable Oversubscription
Easy to AdjustMassive ScaleControlled Failures
Spine Failure Increases Oversubscription
Agenda
The history of L2Routing in the datacenterBGP in the datacenterTroubleshooting improvementsBGP on Servers
cumulusnetworks.com 18
BGP as an IGP
RFC Draft submitted 2014Microsoft and FacebookTargeting DCAll the hows and whys
cumulusnetworks.com 19
But I thought BGP was…
…slow Nope. Not with BFD and timer tuning. Just as fast as OSPF.
…hard to configure We’ll get to that one later, but it can be easy
…only for service providers SPs build for scale and stability. You should too
…hard to troubleshoot Nice and easy when everything is defined + recent
advancescumulusnetworks.com 20
Single ASN for SpinesUnique ASN for LeafsUse Private ASN range2-byte (1023):
64512 – 65534
4-byte (94 million): 4200000000 - 4294967294
BGP Datacenter Design
cumulusnetworks.com 21
65534 65534
64512 64513 64514
Reducing BGP Configuration Complexity
Classically lots to manage
cumulusnetworks.com 22
65534 65534
64512 64513 64514
router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as 64512 neighbor 10.1.1.2 remote-as 64513 neighbor 10.1.1.3 remote-as 64514 neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
Reducing BGP Configuration Complexity
First – Simplify Remote AS
cumulusnetworks.com 23
65534 65534
64512 64513 64514
router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as 64512 neighbor 10.1.1.2 remote-as 64513 neighbor 10.1.1.3 remote-as 64514 neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
Reducing BGP Configuration Complexity
First – Simplify Remote AS
cumulusnetworks.com 24
65534 65534
64512 64513 64514
router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as external neighbor 10.1.1.2 remote-as external neighbor 10.1.1.3 remote-as external neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
Reducing BGP Configuration Complexity
First – Simplify Remote AS
cumulusnetworks.com 25
65534 65534
64512 64513 64514
router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as external neighbor 10.1.1.2 remote-as external neighbor 10.1.1.3 remote-as external neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
remote-as internal as well
Reducing BGP Configuration Complexity
Next – Use Peer Groups
cumulusnetworks.com 27
65534 65534
64512 64513 64514
router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as external neighbor 10.1.1.2 remote-as external neighbor 10.1.1.3 remote-as external neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
Reducing BGP Configuration Complexity
Next – Use Peer Groups
cumulusnetworks.com 28
65534 65534
64512 64513 64514
router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 peer-group leafs neighbor 10.1.1.2 peer-group leafs neighbor 10.1.1.3 peer-group leafs neighbor leafs remote-as external neighbor leafs timers 1 3 neighbor leafs timers connect 3
Reducing BGP Configuration Complexity
Finally – BGP Unnumbered
cumulusnetworks.com 29
65534 65534
64512 64513 64514
router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 peer-group leafs neighbor 10.1.1.2 peer-group leafs neighbor 10.1.1.3 peer-group leafs neighbor leafs remote-as external neighbor leafs timers 1 3 neighbor leafs timers connect 3
Reducing BGP Configuration Complexity
Finally – BGP Unnumbered
cumulusnetworks.com 30
65534 65534
64512 64513 64514
router bgp 65534 router-id 10.0.0.1 neighbor swp1 peer-group leafs neighbor swp2 peer-group leafs neighbor swp3 peer-group leafs neighbor leafs remote-as external neighbor leafs timers 1 3 neighbor leafs timers connect 3
BGP Unnumbered
Uses IPv6 Link Local addresses Automatically assigned, no address
management
No need for infrastructure Ips Only need Loopbacks
Advertises both IPv4 and IPv6 Routes RFC 5549. Full interop with Cisco, Arista,
Junipercumulusnetworks.com 31
Agenda
The history of L2Routing in the datacenterBGP in the datacenterTroubleshooting improvementsBGP on Servers
cumulusnetworks.com 34
BGP Troubleshooting Improvements - Traceroute
How do you troubleshoot links without IPs?
Traceroute improvements Report back loopback IP
cumulusnetworks.com 35
BGP Troubleshooting Improvements - Hostnames
Who is the peer?
Hostname BGP extension
draft-walton-bgp-hostname-capability
cumulusnetworks.com 36
Comparing BGP Configurations
Traditional Config
cumulusnetworks.com 37
router bgp 65534 router-id 10.0.0.1 maximum-paths 64 bgp bestpath as-path multipath-relax neighbor 10.1.1.1 remote-as 64512 neighbor 10.1.1.2 remote-as 64513 neighbor 10.1.1.3 remote-as 64514 neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
router bgp 65534 router-id 10.0.0.1 neighbor swp1 peer-group leafs neighbor swp2 peer-group leafs neighbor swp3 peer-group leafs neighbor leafs remote-as external
Cumulus Config
Agenda
The history of L2Routing in the datacenterBGP in the datacenterTroubleshooting improvementsBGP on Servers
cumulusnetworks.com 38
BGP to the Server
Why stop at the top of rack?BGP to the Server!Cumulus Quagga, GoBGP, Bird.
Just Linux Apps!
No L2, No mLAG, No Infrastructure IPs Use BGP Unnumbered
Same troubleshooting and monitoringcumulusnetworks.com 39
Summary
L3 > L2 At least 1 better Routing provides better scale and stability
Easy to configure, automate, troubleshoot
BGP all the way to the server!Smart defaults and Configuration Simplifications cumulusnetworks.com 41
© 2014 Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. The registered trademark Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.
Thank You!
cumulusnetworks.com 42
Asaf Wachtel, Sr. Director EnterpriseJuly 2016
25GbE Technology Update
© 2016 Mellanox Technologies 44- Mellanox Confidential -
Open APIs
Open Composable Networks
Automation
End-to-End Interconnect
Network OS
ChoiceSONiC
© 2016 Mellanox Technologies 45- Mellanox Confidential -
Open Networking is Real: OCP Summit March 2016
© 2016 Mellanox Technologies 46- Mellanox Confidential -
25/50/100GbE: The Future is Here!
Compute Nodes
Storage Nodes
Network40GbE
10GbE 40GbE
Compute Nodes
150% Higher Bandwidth
Storage Nodes
25% Higher Bandwidth
Network150%
Higher Bandwidth
100GbE
25GbE 50GbE
Similar ConnectorsSimilar Infrastructure
Similar Cost / Power
© 2016 Mellanox Technologies 47- Mellanox Confidential -
Who needs more than 10GbE?
Latest multi-core Intel CPUs can easily drive more than 10Gb/s
Cloud (public or private)• Multi-tenancy• Need to deliver higher SLAs with lower predictability
Hyperconverged / Software Defined Storage / NVMe• Network & Storage on the same wire• Faster & Cheaper storage media
Database / Big Data• Increasing volumes• Moving from batch to real-time
Network Function Virtualization (NFV)• I/O intensive data plane
© 2016 Mellanox Technologies 48- Mellanox Confidential -
Why 25GbE? Do the Math!
Best match for current PCI technology• PCIe3x8 = ~52Gb/s; 2 x 25 = 50Gb/s
Most efficient switch silicon design• Maximizes both ports and bandwidth• 40GbE requires 4 lanes per port == cost + power
Unmatched price-performance / Best price per Gb/s• 25G = 2.5X BW at 1.5x the price
Lower OPEX & TCO• Cut number of NICs, cables, switch ports in half• Lower power & cooling
Better switch port density • Fewer uplinks needed to maintain 1:1 subscription
Uses existing fiber infrastructure (single lane) Fully backward compatible
• Mix/match new 25GbE components and existing 10GbE
Future proof + economies of scale (50/100GbE)• 50Gb is 2x25G, 100G is 4x25G
2.5X bandwidth with single-lane technology
© 2016 Mellanox Technologies 49- Mellanox Confidential -
25GbE Industry Timeline
March 2014: Microsoft presents proposal for 25GbE to IEEE, leveraging
existing activities, such as 25G PHY (100GbE) & SFP28 (32G FC)
July 2014: Open Industry Consortium to Bring 25 and 50 Gigabit Ethernet to
Cloud-Scale Networks
August 2015: First products ship to end customers
September 2015: The 25G Ethernet Consortium specification draft completed
December 2015: Multi-vendor interoperability validated by multiple customers
Q4 2015 – Q2 2016: Ecosystem grows and matures
June 2016: IEEE 802.3by standard approved by The IEEE-SA Standards
Board
© 2016 Mellanox Technologies 50- Mellanox Confidential -
25GbE vs 10GbE
25GbE 10GbEPicture
Standard SFP28 SFP+Physical Form Factor SFP SFPNumber of lanes 1 1Lane speed 25Gbps 10Gbps
Encoding 64b/66b 64b/66bBackward/Forward Compatibility
Fully interoperable @ 10Gb/s
Fully interoperable @ 10Gb/s
Max Copper Reach 5m 7mMM Fiber Reach 100m 300mSM Fiber Reach 10KM 10KM
© 2016 Mellanox Technologies 51- Mellanox Confidential -
3 Types of Connectivity Products
Direct Attach Copper (DAC)
“Transceiver”4-channels Transmit4-channels Receiver
Copper Wires.Directly Attaches one system to another
Key feature = Lowest Priced Link<3m reaches
Optical TransceiverConverts electrical signals to optical.
Transmits blinking laser light over optical fiber.Key feature = long reach - up to 10Km.
Active Optical Cable2 Transceivers with optical fiber bonded in.Key feature = Lowest Priced Optical Link
100m/200m Reaches
10G/40G14G/56G
25G/50G/100G
InfiniBand: DAC & AOCs
Ethernet: DAC & AOCs & TransceiversSFP28
LC Transceiver
QSFP28 LC
Transceiver QSFP28 MPO
Transceiver
VCSEL & Silicon Photonics
Multi-mode &
Single-mode
MPO & LC Connectors
© 2016 Mellanox Technologies 52- Mellanox Confidential -
As Data Rates Increase, Distances Decrease Favoring Silicon Photonics + Single-mode Fiber
Link Length (m)10 100 500150 300 1000 2000
10
25
50
3 51
20
Dat
a R
ate
per L
ane
(Gb\
s)
10000500020 30 50 752
Single mode fiber
OM4OM3
Copper Multi-mode fiber
Silicon Photonics
Direct Attach Copper• Zero power• Demo’d 8m at 100G• Best fit 3m
DACs
Active Optical Cables• VCSEL 100m• Silicon Photonics 200m• Best fit for 5-20m
SR/SR4 VCSEL Transceivers• Reaches to 100m• Best fit for MMF• Structured cabling
Silicon Photonics Transceivers• Reaches to 2km • Best fit for SMF• Parallel PSM4 or WDM4
3-5M 70m 100M
MMF= MULTI-MODE FIBER SMF = SINGLE-MODE FIBER
2Km/10KmSR-SR4VCSELs
© 2016 Mellanox Technologies 53- Mellanox Confidential -
Webscale IT Innovation: QSFP TOR for 4x Density and Lower COGS
EST = $166Single cable!
Break-out cabling vs standard cabling
Ideal port density and configuration deployment options
4 cables = $216
Qty (4) cables @ $54
Benefits• Easier cable management
• fewer cables
• 23% lowers cost
Benefits• Flexible configuration options
• Highest port density
• Lowest power consumption
• Half-width deployment option • 4 SFP+ plus 4 QSFP+ ports• Up to 128 ports of 10GbE in 2 RU• Illogical configuration with wasted ports
* RU = rack unit
• 16 QSFP28 ports (32 in 1 RU*)• Up to 128 10/25GbE ports in 1 RU• Logical configuration options:
• Redundant “48 + 4” in 1 RU
Mellanox Competition
To achieve equivalent bandwidth
$1000 less cable cost per rack
© 2016 Mellanox Technologies 54- Mellanox Confidential -
Summary: 25/50/100GbE is Here!
Interconnect
Adapter 100GbE Adapter
150 million messages per second
10 / 25 / 40 / 50 / 56 / 100GbE
32 100GbE Ports, 64 25/50GbE Ports
10 / 25 / 40 / 50 / 56 / 100GbE
Throughput of 6.4Tb/s
Switch
Software
Transceivers
Active Optical and Copper Cables
10 / 25 / 40 / 50 / 56 / 100GbE VCSELs, Silicon Photonics and Copper