+ All Categories
Home > Technology > Ethernet and TCP optimizations

Ethernet and TCP optimizations

Date post: 12-May-2015
Category:
Upload: jeff-squyres
View: 1,985 times
Download: 3 times
Share this document with a friend
Description:
With a trivial bit of tuning, you can extract fairly amazing small message latencies out of TCP. This ain't your father's Ethernet (or TCP).
Popular Tags:
60
Cisco Confidential 1 © 2012 Cisco and/or its affiliates. All rights reserved. Ethernet: Hidden Secrets Jeff Squyres
Transcript
Page 1: Ethernet and TCP optimizations

Cisco Confidential 1© 2012 Cisco and/or its affiliates. All rights reserved.

Ethernet: Hidden Secrets Jeff Squyres

Page 2: Ethernet and TCP optimizations

First: some backgroundinformation…

Page 3: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3

Jeff’s work: Parallel computing at Cisco

Using lots and lots and lots of servers simultaneouslyto solve one computational problem

Page 4: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4

Supercomputing applications

Racks of36 1U

servers

Tend to send lots and lots and lots of small messagesacross the network to stay in sync with each other

Page 5: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5

Network message traversal

Underlying network

Send amessage

Receive themessage

A B

Page 6: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6

Network message traversal

Underlying network

Send amessage

Receive themessage

Today’s fastest networks:1-3μs (!)

A B

Page 7: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7

Today’s fastest networks

• Typically not Ethernet networks

• Usually have supercomputer-specific networksExample: highly tuned for short message latency

• …but that is changing

Ethernet Ethernot

Page 8: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8

Cisco’s ultra low latency Ethernet

• Userspace NIC (“USNIC”)Expose Cisco NIC hardware directly to Linux userspace

Bypass the OS

Bypass the TCP stack

• Send raw Ethernet frames directly from user applicationsMuch, much faster than traditional TCP-based networking

Especially for latency of short messages

Page 9: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9

Kernel

Cisco VIC hardware

TCP / IP stack

Cisco VIC driver

Normal TCP software architecture

UserspaceUserspace sockets library

MPI library

Application

Page 10: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10

Kernel

Userspace verbs library

Cisco VIC hardware

Cisco USNIC software

MPI library

Userspace

Verbs IB core

Cisco USNIC driver

Bootstrappingand setup

Send and receivefast path

Application

Page 11: Ethernet and TCP optimizations

With all that background…

Page 12: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12

Doing some performance testing last week…

Two servers

Page 13: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13

Doing some performance testing last week…

Two servers

Each with a 2 x 10Gb NIC

Page 14: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14

Doing some performance testing last week…

Two servers

Each with a 2 x 10Gb NICConnected back-to-back

Page 15: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15

“Ping pong” latency test

Send a messagefrom here

Receive the messagehere

Ping!

Page 16: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16

“Ping pong” latency test

Get the messageback

Send the messageback

Pong!

Page 17: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17

“Ping pong” latency test

Because each ping and pong are soooo short,do this ping-pong exchange N times

Ping! / Pong!

Page 18: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18

“Ping pong” latency test

Total time for N ping-pongs

N

Time for one ping-pong

Page 19: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19

“Ping pong” latency test

Total time for N ping-pongs

N

Time for one ping-pong

2

Time for one ping

Page 20: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20

Time for one ping

Half-round trip (HRT)ping pong latency

=

Page 21: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 21

Results: using 1x10G Ethernet port

1 byte~60μs

8MB~150ms

Page 22: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 22

Results: using 2x10G Ethernet ports

1 byte~60μs

8MB~150ms

8MB~8.3ms

1 byte~30μs (!)

Page 23: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 23

Results: using 2x10G Ethernet ports

1 byte~60μs

8MB~150ms

8MB~8.3ms

1 byte~30μs (!)

WHOA!

Page 24: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 24

Results: just the small messages

The facts:From 1-1024 bytes: flat latency

Using 1 interface: ~60μsUsing 2 interfaces: ~30μs

~60μs

~30μs

Page 25: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 25

Results: just the small messages

The facts:From 1-1024 bytes: flat latency

Using 1 interface: ~60μsUsing 2 interfaces: ~30μs

~60μs

~30μsWHY?

Page 26: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 26

Must look at how TCP works…

1. Ethernet frame arrives

Page 27: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 27

Must look at how TCP works…

1. Ethernet frame arrives

2. NIC sends interruptto OS Ethernet driver

Page 28: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 28

Must look at how TCP works…

1. Ethernet frame arrives

2. NIC sends interruptto OS Ethernet driver

3. OS Ethernet drivercopies the packet to RAM

Page 29: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29

Must look at how TCP works…

1. Ethernet frame arrives

2. NIC sends interruptto OS Ethernet driver

3. OS Ethernet drivercopies the packet to RAM

4. OS TCP stack handspacket off to (whatever)

Page 30: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 30

The Costco Rule

It’s always better in bulk

Page 31: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31

Why copy one packet at a time?

Let’s optimizethis part

Page 32: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 32

Two (commonly used) optimizations

1. Copy a bunch ofpackets across PCI

at one time

Page 33: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 33

Two (commonly used) optimizations

2. Only raise oneinterrupt for all of

those packet copies

1. Copy a bunch ofpackets across PCI

at one time

Page 34: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 34

Two (commonly used) optimizations

2. Only raise oneinterrupt for all of

those packet copies

1. Copy a bunch ofpackets across PCI

at one time

A.k.a. “Interrupt Coalescing”

Page 35: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 35

Interrupt coalescing

1. Ethernet frame arrives

Page 36: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 36

Interrupt coalescing

1. Ethernet frame arrives

2. Has N time passedsince we sent an

interrupt to the OS?

Page 37: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37

Interrupt coalescing

1. Ethernet frame arrives

2. Has N time passedsince we sent an

interrupt to the OS?

No: queue up the frame✖

Page 38: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 38

Interrupt coalescing

1. Ethernet frame arrives

2. Has N time passedsince we sent an

interrupt to the OS?

No: queue up the frame✖✔ Yes: Send all queued frames and interrupt

Page 39: Ethernet and TCP optimizations

Ok… So what?

Page 40: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40

The key: NIC interrupt coalescing timers

Page 41: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 41

Timeline of a ping pong

NIC A

NIC B

1. A sends ping frame

2. B receives ping frame

Periodic interruptcoalescing timeout

125μs

Page 42: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 42

Timeline of a ping pong

NIC A

NIC B

3. Coalesce timer expires; B sends interrupt4. B sends pong frame

Page 43: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43

Timeline of a ping pong

NIC A

NIC B

5. Coalesce timer expires; A sends interrupt6. A sends ping frame7. Rinse, repeat

Page 44: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 44

Timeline of a ping pong

NIC A

NIC B

4 ping-pongs in ~8x timer duration

Page 45: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 45

Timeline of a ping pong

NIC A

NIC B

In general, coalescing interrupts is a very Very Good Thing

Page 46: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 46

Timeline of a ping pong

NIC A

NIC B

But it definitely hurts low-latency traffic

Page 47: Ethernet and TCP optimizations

How do we reduce those artificial delays?

Page 48: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 48

Two Ethernet ports with out-of-sync timers

NIC A

NIC B

NIC A

NIC B

Por

t 0

Por

t 1

Page 49: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 49

Get more round trips in same amount of time

NIC A

NIC B

NIC A

NIC B

Por

t 0

Por

t 1

Page 50: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 50

Get more round trips in same amount of time

NIC A

NIC B

NIC A

NIC B

Por

t 0

Por

t 1

In reality, sender and receiver timers on each port are wholly unrelated; they don’t line up

nicely like I used in these examples.

Meaning: in general, you actually usually get better overlap

Page 51: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 51

Results: just the small messages

~60μs

~30μs

In this case, we got such good asymmetry, that the 2 port case is ~2x as fast (i.e., roughly twice as many interrupts in the same amount of time)

Page 52: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 52

Lies, damn lies, and statistics

Remember:these are AVERAGE

latencies!

Individual ping-pong timesare the same as the

1 port case (from the network)

…but you get higher throughputbecause we’re reducing the

gaps between each ping-pong

Page 53: Ethernet and TCP optimizations

Now let’s trysomething else…

Page 54: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54

Set the coalesce timer at 0

Page 55: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55

New ping-pongs much faster!

1 port~10.5μs

2 ports~10.6μs

1 port~7.2ms

2 ports~5.5ms

Page 56: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 56

What are the tradeoffs?

Pros• (Much) faster TCP latency

…without changing app!

• Faster speeds seem to scale up to large messages, too

• Great for low-latency, sparse comms apps

• Best for NICs that are dedicated to MPI comms

Cons• May not scale well for

case of MPI process running on every core

• Lots and lots of interrupts going to socket:0.core:0

• May need to run (N-1) MPI processes…?

May also want to avoid socket:0.core:0, or move IRQ affinity

Page 57: Ethernet and TCP optimizations

Your mileage may vary

Page 58: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 58

But it’s interesting, nonetheless!

• Some experimentation might be worth trying with real world HPC apps:

• Allow TCP to wholly utilize core 0 (i.e., run MPI processes only on cores 1-15)

• Set the coalesce timer to something more than 0μs, but less than 125μs – there’s a whole spectrum with which to play

Page 59: Ethernet and TCP optimizations

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 59

My overall points:

• Many in HPC have Ethernot networks …but as HPC continues to commoditize itself, lots of HPC users have Ethernet-based environments

• Today’s Ethernet switches and NICs are actually quite a bit faster and more advanced than what we old-time-HPCers grew up with

• Even good ol’ TCP is amazingly fast and optimized today

• You may be able to tune your NIC and/or fabric to extract pretty darn good MPI TCP performance

The default settings on your Ethernet NIC / fabric are likely set for general TCP traffic – which effect very different performance characteristics than what HPC applications typically need

Page 60: Ethernet and TCP optimizations

Thank you.


Recommended