+ All Categories
Home > Documents > Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI...

Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI...

Date post: 04-Jul-2020
Category:
Upload: others
View: 234 times
Download: 1 times
Share this document with a friend
33
Confidential + Proprietary From Queues to Earliest Departure Time Nandita Dukkipati, Eric Dumazet, Van Jacobson, Amin Vahdat & … David Wetherall (presenter) Oct 2018 1
Transcript
Page 1: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Confidential + Proprietary

From Queues to Earliest Departure TimeNandita Dukkipati, Eric Dumazet, Van Jacobson, Amin Vahdat & …David Wetherall (presenter)

Oct 2018

1

Page 2: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Google Platforms and Infrastructure designs...DatacentersComputersNetworksSoftware

... to turn a network of computers into a single global computer.

2

Page 3: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Traffic generated by servers in our datacentersAg

greg

ate

traf

fic

50x

1xJul ‘08 Jun ‘09 May ‘10 Apr ‘11 Mar ‘12 Feb ‘13 Dec ‘13 Nov ‘14

Time

Scaling: Data Center Network Bandwidth Growth

3

Page 4: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Watchtower

Saturn

Firehose 1.1

Google Datacenter Network InnovationAnd hardware scale that we could not buy

Time

Capa

city

Firehose 1.0

Jupiter

4 Post

1.3Pb/s clusters in 2013

4

Page 5: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Google’s data center in Douglas County, GAhttps://www.google.com/about/datacenters/

5

Page 6: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Google Global Cache edge nodes

FASTER (US, JP, TW) 2016

Unity (US, JP) 2010SJC (JP, HK, SG) 2013

Points of presence >100

Network fiber

Google NetworkMore than a collection of data centers

6

Page 7: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

B4

20062008

20102012

2014

Google Global Cache

BwE

Jupiter gRPC

Onix

Freedome

Watchtower

QUIC

Andromeda

Our distributed computing infrastructure required networks that did not exist

TIMELY

2016

maglev

BBR

Google Network innovations

7

Page 8: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

SIGCOMM 2018 B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for …NSDI 2018 Andromeda: Performance, Isolation, and Velocity at Scale in Cloud ...SIGCOMM 2017 Carousel: Scalable Traffic Shaping at End-HostsSIGCOMM 2017 Taking the Edge off with Espresso: Scale, Reliability and Programmability ...

CACM 2017 BBR: Congestion-Based Congestion Control SIGCOMM 2016 An Internet-Wide Analysis of Traffic Policing SIGCOMM 2016 Evolve or Die: High-Availability Design Principles Drawn from Failures in a ... NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY: RTT-based Congestion Control for the Datacenter SIGCOMM 2015 Condor: Better Topologies through Declarative Design SIGCOMM 2015 Bandwidth Enforcer: Flexible, Hierarchical Bandwidth Allocation for WAN ... SIGCOMM 2015 Jupiter Rising: A Decade of Clos Topologies and Centralized Control in ... NSDI 2014 Libra: Divide and Conquer to Verify Forwarding Tables in Huge Networks SIGCOMM 2013 B4: Experience With a Globally-Deployed Software Defined WAN

See http://g.co/research/networkinfra

Subset of Google networking publications

8

Page 9: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Confidential + Proprietary

From Queues to Earliest Departure TimeNandita Dukkipati, Eric Dumazet, Van Jacobson, Amin Vahdat & …David Wetherall (presenter)

Oct 2018

9

Page 10: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

TCP sends AFAP (as fast as possible)

TCP’s reliable delivery constrains how much is sent but not how fast.

● lead to an AFAP (as fast as possible)* output contract, with shaping usually implemented by device output queue(s).

● queue length limit or receive window determines inflight between protocol & device.

● ‘how fast’ is implicit in the queue drain rate, a constraint that is local & upstream of the wire

10

*a.k.a. work conserving

Page 11: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Packet Sources: Socket Buffers in Host OS or Guest VM

Sche

dule

r

To NICRate1

Rate2

Rate3

Shaper (AFAP)

Clas

sifie

r

Multiple Token Bucket queues in shaper for bandwidth policies

Traffic Shaping with Queues

11

Page 12: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

AFAP helped TCP/IP speed up 10,000x

12(25 years of Ethernet evolution)

Page 13: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

… but AFAP makes bottleneck run at 100% Queuing theory says this is fragile. E.g., for M/D/1:

13

Page 14: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Mitigating Factors

Pain in the form of delay/loss explosion comes from running above the line rate at the bottleneck for too long. This is less of an issue if:

● bandwidth-delay products are small and/or

● there’s a fat-buffered router in front of every bottleneck and/or

● links from hosts to ToRs run slower than fabric

The first of these saved us until ~1995 then the second & third until ~2012.

Since then pain has been increasing.

14

Page 15: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

After 2000, going faster got hard

15Difficult to keep fabric switches 10x faster than server NICs

Page 16: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

If we have less space (queues) to work with then we need to rely more on time● Determine what’s AFAP at the bottleneck and run at that rate ...

Examples:● HULL (NSDI’12) – ‘Less is more: trading a little bandwidth for ultra-low latency’● BwE (Sigcomm’15) – ‘Flexible, Hierarchical Bandwidth Allocation for WAN’● FQ/pacing (IETF88’13) – ‘TSO, fair queuing, pacing: three’s a charm’● Timely (Sigcomm’15) – ‘RTT-based congestion control for the datacenter’● BBR (CACM v60’17) – ‘Congestion-based congestion control’● Carousel (Sigcomm’17) – ‘Scalable traffic shaping at end hosts’

16

Pain Relief

Page 17: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Moving from AFAP to EDT (earliest departure time)

AFAP isn’t working for us now because it’s local and our problems aren’t

We need a model that allows more nuanced control of packet spacing on the wire● The EDT model is a great candidate for a replacement● Specify the earliest departure time of packets to control release

The enforcement mechanism needs to be just in front of (or in) the NIC to enforce relationships between all outgoing packets.● Carousel (details soon) is a great example of such a mechanism.

17

Page 18: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Packet Sources: Socket Buffers in Host OS or Guest VM

Sche

dule

r

To NICRate1

Rate2

Rate3

Shaper (AFAP)

Clas

sifie

r

Multiple Token Bucket queues in shaper for bandwidth policies

Traffic Shaping with Queues

18

Page 19: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Meet the needs of network policies & congestion control, e.g. TCP ● Pace packets● Provide backpressure● Avoid HOL blocking

Use CPU & memory efficiently

Requirements for Traffic Shapers are Two Sides of the Same Coin

19

Page 20: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Challenges:

CPU cost of maintaining queues grows super-linearly with #queues.

Synchronization cost on multi-CPU systems is dominated by locking and contention overhead when sharing queues amongst CPUs.

Example 1: Hierarchical Token Bucket (HTB) Linux Queuing Discipline.

20

Page 21: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Challenges:

CPU cost of maintaining queues grows super-linearly with #queues.

Synchronization cost on multi-CPU systems is dominated by locking and contention overhead when sharing queues amongst CPUs.

Example 2: FQ/pacing Linux Queueing Discipline

21

Page 22: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

We do don’t need queues and their associated cost

Using time as a basic construct gives us all the control we need, and at very low cost

22

Page 23: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Carousel’s core idea is to replace a complex of slow, brittle, concatenated queues with two simple pieces:

1. An Earliest Departure Time (EDT) timestamp in every socket buffer (skb)

2. A timing-wheel* scheduler replacing the queue in front of (or in) the NIC.

*= see, for example, Hashed and Hierarchical Timing Wheels, Varghese & Lauck, SOSP 87.

23

Page 24: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

To NIC

Shaper (EDT)

1) Single, O(1), time-indexed queue, ordered by packet departure timestamps

Tim

esta

mpe

r

Socket Buffers

Design of Carousel

24

Page 25: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

To NIC

1) Single, O(1), time-indexed queue, ordered by packet departure timestamps

Tim

esta

mpe

r

2) Apply Backpressure

Shaper (EDT)

25

Page 26: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

To NIC

1) Single, O(1), time-indexed queue, ordered by packet departure timestamps

Tim

esta

mpe

r

2) Apply Backpressure

3) One Shaper per Core

Shaper (EDT)

26

Page 27: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

To NIC

Shaper

Tim

esta

mpe

r

Dequeue packet and deliver completion

Enqueue Packet in Timing Wheel

Compute Earliest Departure Time based on

shaping rate

Life of a Packet in Carousel

27

Page 28: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Timing wheel insert & delete is O(1) like a queue but with a smaller multiplier:● cache friendly (no pointer chains)● RCU friendly (single slot to update)

Driver (or NIC) gets to choose ‘event horizon’ (wheel length) so can do BQL-like tuning for long enough to fill wire but short enough to not blow away caches.

28

A timing wheel does what a queue does (and more) but is faster

Page 29: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Packets that would be sent after event horizon can get TSQ-like callback when they can be sent or get an ETooFar.

This replaces TSQ and fixes problem of many simultaneous writers generating huge queues.

It also puts hard bounds # of active output bytes, increasing probability of L3 cache hits for systems that can DMA from L3.

29

A timing wheel does what a queue does (and more) but is faster

Page 30: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Qdiscs become purely computational – no more intermediate queues.

→ driver gets to see all packets in its event horizon so can easily do informed interrupt mitigation, lazy reclaim, (wifi) endpoint aggregation…

→ sender learns packet send time on send() and can handle deadlines, seek alternatives, do phase correction...

30

Qdiscs accomplish more with less

Page 31: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

In essence, timing wheel is an in-memory representation of how packets will appear on wire. It can represent almost any causal scheduling policy.

(Policies like ‘Maximize Completion Rate’ are impossible to express with rates but easy with timestamps so we can finally make transactions ‘fair’ without stupidly slowing everything down.)

31

Qdiscs accomplish more with less

Page 32: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Summary

It’s time to change the host network model for sending from AFAP to EDT

It’s a match for our modern need to control packet spacing on the wire

It’s more efficient and effective than complex arrangements of queues

It admits rich scheduling policies

We hope it will unleash a wave of innovation

We’re converts, and we hope that you are too!

32

Page 33: Oct 2018 David Wetherall (presenter) Nandita Dukkipati ... Talks/2018/David_Wetherall… · NSDI 2016 Maglev: A Fast and Reliable Software Network Load Balancer SIGCOMM 2015 TIMELY:

Thank You. Questions?

33


Recommended