Date post: | 12-May-2015 |
Category: |
Technology |
Upload: | jeff-squyres |
View: | 1,985 times |
Download: | 3 times |
Cisco Confidential 1© 2012 Cisco and/or its affiliates. All rights reserved.
Ethernet: Hidden Secrets Jeff Squyres
First: some backgroundinformation…
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
Jeff’s work: Parallel computing at Cisco
Using lots and lots and lots of servers simultaneouslyto solve one computational problem
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
Supercomputing applications
Racks of36 1U
servers
Tend to send lots and lots and lots of small messagesacross the network to stay in sync with each other
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
Network message traversal
Underlying network
Send amessage
Receive themessage
A B
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
Network message traversal
Underlying network
Send amessage
Receive themessage
Today’s fastest networks:1-3μs (!)
A B
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
Today’s fastest networks
• Typically not Ethernet networks
• Usually have supercomputer-specific networksExample: highly tuned for short message latency
• …but that is changing
Ethernet Ethernot
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
Cisco’s ultra low latency Ethernet
• Userspace NIC (“USNIC”)Expose Cisco NIC hardware directly to Linux userspace
Bypass the OS
Bypass the TCP stack
• Send raw Ethernet frames directly from user applicationsMuch, much faster than traditional TCP-based networking
Especially for latency of short messages
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
Kernel
Cisco VIC hardware
TCP / IP stack
Cisco VIC driver
Normal TCP software architecture
UserspaceUserspace sockets library
MPI library
Application
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
Kernel
Userspace verbs library
Cisco VIC hardware
Cisco USNIC software
MPI library
Userspace
Verbs IB core
Cisco USNIC driver
Bootstrappingand setup
Send and receivefast path
Application
With all that background…
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
Doing some performance testing last week…
Two servers
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
Doing some performance testing last week…
Two servers
Each with a 2 x 10Gb NIC
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
Doing some performance testing last week…
Two servers
Each with a 2 x 10Gb NICConnected back-to-back
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15
“Ping pong” latency test
Send a messagefrom here
Receive the messagehere
Ping!
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16
“Ping pong” latency test
Get the messageback
Send the messageback
Pong!
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17
“Ping pong” latency test
Because each ping and pong are soooo short,do this ping-pong exchange N times
Ping! / Pong!
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18
“Ping pong” latency test
Total time for N ping-pongs
N
Time for one ping-pong
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19
“Ping pong” latency test
Total time for N ping-pongs
N
Time for one ping-pong
2
Time for one ping
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20
Time for one ping
Half-round trip (HRT)ping pong latency
=
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 21
Results: using 1x10G Ethernet port
1 byte~60μs
8MB~150ms
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 22
Results: using 2x10G Ethernet ports
1 byte~60μs
8MB~150ms
8MB~8.3ms
1 byte~30μs (!)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 23
Results: using 2x10G Ethernet ports
1 byte~60μs
8MB~150ms
8MB~8.3ms
1 byte~30μs (!)
WHOA!
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 24
Results: just the small messages
The facts:From 1-1024 bytes: flat latency
Using 1 interface: ~60μsUsing 2 interfaces: ~30μs
~60μs
~30μs
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 25
Results: just the small messages
The facts:From 1-1024 bytes: flat latency
Using 1 interface: ~60μsUsing 2 interfaces: ~30μs
~60μs
~30μsWHY?
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 26
Must look at how TCP works…
1. Ethernet frame arrives
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 27
Must look at how TCP works…
1. Ethernet frame arrives
2. NIC sends interruptto OS Ethernet driver
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 28
Must look at how TCP works…
1. Ethernet frame arrives
2. NIC sends interruptto OS Ethernet driver
3. OS Ethernet drivercopies the packet to RAM
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29
Must look at how TCP works…
1. Ethernet frame arrives
2. NIC sends interruptto OS Ethernet driver
3. OS Ethernet drivercopies the packet to RAM
4. OS TCP stack handspacket off to (whatever)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 30
The Costco Rule
It’s always better in bulk
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31
Why copy one packet at a time?
Let’s optimizethis part
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 32
Two (commonly used) optimizations
1. Copy a bunch ofpackets across PCI
at one time
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 33
Two (commonly used) optimizations
2. Only raise oneinterrupt for all of
those packet copies
1. Copy a bunch ofpackets across PCI
at one time
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 34
Two (commonly used) optimizations
2. Only raise oneinterrupt for all of
those packet copies
1. Copy a bunch ofpackets across PCI
at one time
A.k.a. “Interrupt Coalescing”
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 35
Interrupt coalescing
1. Ethernet frame arrives
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 36
Interrupt coalescing
1. Ethernet frame arrives
2. Has N time passedsince we sent an
interrupt to the OS?
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
Interrupt coalescing
1. Ethernet frame arrives
2. Has N time passedsince we sent an
interrupt to the OS?
No: queue up the frame✖
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 38
Interrupt coalescing
1. Ethernet frame arrives
2. Has N time passedsince we sent an
interrupt to the OS?
No: queue up the frame✖✔ Yes: Send all queued frames and interrupt
Ok… So what?
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40
The key: NIC interrupt coalescing timers
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 41
Timeline of a ping pong
NIC A
NIC B
1. A sends ping frame
2. B receives ping frame
Periodic interruptcoalescing timeout
125μs
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 42
Timeline of a ping pong
NIC A
NIC B
3. Coalesce timer expires; B sends interrupt4. B sends pong frame
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43
Timeline of a ping pong
NIC A
NIC B
5. Coalesce timer expires; A sends interrupt6. A sends ping frame7. Rinse, repeat
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 44
Timeline of a ping pong
NIC A
NIC B
4 ping-pongs in ~8x timer duration
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 45
Timeline of a ping pong
NIC A
NIC B
In general, coalescing interrupts is a very Very Good Thing
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 46
Timeline of a ping pong
NIC A
NIC B
But it definitely hurts low-latency traffic
How do we reduce those artificial delays?
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 48
Two Ethernet ports with out-of-sync timers
NIC A
NIC B
NIC A
NIC B
Por
t 0
Por
t 1
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 49
Get more round trips in same amount of time
NIC A
NIC B
NIC A
NIC B
Por
t 0
Por
t 1
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 50
Get more round trips in same amount of time
NIC A
NIC B
NIC A
NIC B
Por
t 0
Por
t 1
In reality, sender and receiver timers on each port are wholly unrelated; they don’t line up
nicely like I used in these examples.
Meaning: in general, you actually usually get better overlap
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 51
Results: just the small messages
~60μs
~30μs
In this case, we got such good asymmetry, that the 2 port case is ~2x as fast (i.e., roughly twice as many interrupts in the same amount of time)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 52
Lies, damn lies, and statistics
Remember:these are AVERAGE
latencies!
Individual ping-pong timesare the same as the
1 port case (from the network)
…but you get higher throughputbecause we’re reducing the
gaps between each ping-pong
Now let’s trysomething else…
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54
Set the coalesce timer at 0
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55
New ping-pongs much faster!
1 port~10.5μs
2 ports~10.6μs
1 port~7.2ms
2 ports~5.5ms
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 56
What are the tradeoffs?
Pros• (Much) faster TCP latency
…without changing app!
• Faster speeds seem to scale up to large messages, too
• Great for low-latency, sparse comms apps
• Best for NICs that are dedicated to MPI comms
Cons• May not scale well for
case of MPI process running on every core
• Lots and lots of interrupts going to socket:0.core:0
• May need to run (N-1) MPI processes…?
May also want to avoid socket:0.core:0, or move IRQ affinity
Your mileage may vary
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 58
But it’s interesting, nonetheless!
• Some experimentation might be worth trying with real world HPC apps:
• Allow TCP to wholly utilize core 0 (i.e., run MPI processes only on cores 1-15)
• Set the coalesce timer to something more than 0μs, but less than 125μs – there’s a whole spectrum with which to play
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 59
My overall points:
• Many in HPC have Ethernot networks …but as HPC continues to commoditize itself, lots of HPC users have Ethernet-based environments
• Today’s Ethernet switches and NICs are actually quite a bit faster and more advanced than what we old-time-HPCers grew up with
• Even good ol’ TCP is amazingly fast and optimized today
• You may be able to tune your NIC and/or fabric to extract pretty darn good MPI TCP performance
The default settings on your Ethernet NIC / fabric are likely set for general TCP traffic – which effect very different performance characteristics than what HPC applications typically need
Thank you.