RecentLinuxTCPUpdates,andhowtotuneyour100Ghost
BrianTierneyandNateHanford,[email protected]://fasterdata.es.net
SC16INDISWorkshop
November13,2016
Observation#1
• TCPismorestableinCentOS7vs CentOS6– Throughputrampsupmuchquicker
• Moreaggressiveslowstart– Lessvariabilityoverlifeoftheflow
11/13/162
BerkeleytoAmsterdam
11/13/164
NewYorktoTexas
Observation#2
• TurningonFQhelpsthroughputevenmore– TCPisevenmorestable– Worksbetterwithsmallbufferdevices
• Pacingtomatchbottlenecklinkworksbetteryet
11/13/165
TCPoption:FairQueuingScheduler (FQ)AvailableinLinuxkernel3.11(releasedlate2013)orhigher
– AvailableinFedora20,Debian 8,andUbuntu13.10– Backported to3.10.0-327kernel inv7.2CentOS/RHEL(Dec 2015)
ToenableFairQueuing(whichisoffbydefault),do:– tc qdisc adddev $ETHrootfqOraddthisto/etc/sysctl.conf:
net.core.default_qdisc =fq
Tobothpaceandshapethetraffic:– tc qdisc adddev $ETHrootfq maxrate Ngbit
• Canreliablypaceuptoamaxrate of32Gbpsonafastprocessor
Canalsodoapplicationpacingusinga‘setsockopt(SO_MAX_PACING_RATE)’systemcall– iperf3supportsthisviathe“—bandwidth’flag
11/13/167
NewYorktoTexas:WithPacing
FQPacketsaremuchmoreevenlyspacedtcptrace/xplotoutput:FQonleft,StandardTCPonright
8
100GHostTuning
11/13/169
TestEnvironment• Hosts:
– Supermicro X10DRiDTNs– IntelXeonE5-2643v3,2sockets,6coreseach– CentOS 7.2runningKernel3.10.0-327.el7.x86_64– Mellanox ConnectX-4EN/VPI100GNICswithportsinENmode– Mellanox OFEDDriver3.3-1.0.4(03Jul2016),Firmware12.16.1020
• Topology– BothsystemsconnectedtoDellZ9100100GbpsONTop-of-RackSwitch– Uplinktonersc-tb1ALUSR7750Routerrunning100GlooptoStarlightandback
• 92msRTT– UsingTagged802.1qtoswitchbetweenLoopandLocalVLANs– LANhad54usecRTT
• Configuration:– MTUwas9000B– irqbalance,tuned,andnumad wereoff– coreaffinitywassettocores7and8(ontheNUMAnodeclosesttotheNIC)– AlltestsareIPV4unlessotherwisestated
11/13/1610
nersc-tb1
Dell z9100
nersc-tbn-4 nersc-tbn-5
star-tb1
100G loop: RTT = 92ms
100G
StarLight (Chicago)
Oakland, CA
Each host has:• Mellanox ConnectX-4 (100G)• Mellanox ConnectX-3 (40G)
Alcatel 7750 Router
TestbedTopologyAlcatel 7750 Router
40G
100G 100G40G
OurCurrentBestSingleFlowResults• TCP
– LAN:79Gbps– WAN(RTT=92ms):36.5Gbps,49Gbpsusing‘sendfile’API(‘zero-copy’)– Testcommands:
• LAN:nuttcp -i1-xc7/7–w1m-T30hostname• WAN:nuttcp -i1-xc7/7–w900M-T30hostname
• UDP:– LANandWAN:33Gbps– Testcommand:nuttcp -l8972-T30-u-w4m-Ru -i1-xc7/7hostname
Othershavereportedupto85GbpsLANperformancewithsimilarhardware
11/13/1612
CPUgovernorLinuxCPUgovernor(P-States)settingmakesabig difference:
RHEL: cpupower frequency-set -g performanceDebian:cpufreq-set -r -g performance
57Gbps defaultsettings(powersave)vs.79Gbps ‘performance’modeontheLANTowatchtheCPUgovernorinaction:watch -n 1 grep MHz /proc/cpuinfo
cpu MHz:1281.109cpu MHz:1199.960cpu MHz:1299.968cpu MHz:1199.960cpu MHz:1291.601cpu MHz:3700.000cpu MHz:2295.796cpu MHz:1381.250cpu MHz:1778.492
11/13/1613
TCPBuffers# add to /etc/sysctl.conf
# allow testing with 2GB buffers
net.core.rmem_max = 2147483647net.core.wmem_max = 2147483647
# allow auto-tuning up to 2GB buffers
net.ipv4.tcp_rmem = 4096 87380 2147483647net.ipv4.tcp_wmem = 4096 65536 2147483647
2GBisthemaxallowableunderLinuxWANBDP=12.5GB/s*92ms=1150MB(autotuningsetthisto1136MB)LANBDP=12.5GB/s*54us=675KB(autotuningsetthisto2-9MB)ManualbuffertuningmadeabigdifferenceontheLAN:
– 50-60Gbpsvs 79Gbps
11/13/1614
zerocopy (sendfile)results
• iperf3–Zoption
• NosignificantdifferenceontheLAN
• SignificantimprovementontheWAN– 36.5Gbpsvs 49Gbps
11/13/1615
IPv4vs IPv6results
• IPV6isslightlyfasterontheWAN,slightlyslowerontheLAN
• LAN:– IPV4:79Gbps– IPV6:77.2Gbps
• WAN– IPV4:36.5Gbps– IPV6:37.3Gbps
11/13/1616
Don’tForgetaboutNUMAIssues
11/13/1617
• Upto2xperformancedifferenceifyouusethewrongcore.
• Ifyouhavea2CPUsocketNUMAhost,besureto:– Turnoffirqbalance– FigureoutwhatsocketyourNICisconnectedto:
cat /sys/class/net/ethN/device/numa_node
– RunMellanox IRQscript:/usr/sbin/set_irq_affinity_bynode.sh 1 ethN
– BindyourprogramtothesameCPUsocketastheNIC:numactl -N 1 program_name
• WhichcoresbelongtoaNUMAsocket?– cat/sys/devices/system/node/node0/cpulist– (note:onsomeDellservers,thatmightbe:0,2,4,6,...)
SettingstoleavealoneinCentOS7
Recommendleavingtheseatthedefaultsettings,andnoneoftheseseemtoimpactperformancemuch
• InterruptCoalescence
• RingBuffersize
• LRO(off)andGRO(on)
• net.core.netdev_max_backlog
• txqueuelen
• tcp_timestamps
11/13/1618
ToolSelection• iperf3,nuttcp,andiperf2havedifferentstrengths.
• nuttcp isabout10%fasteronLANtests,andhaslotsofcooloptions.
• iperf3hasniceretransmit/congestionwindowreport,supportsFQpacing,andJSONoutputoptionisgreatforproducingplots
• Iperf2ismulti-threaded,andbetterforparallelstreamtesting
• Useall!Allarepartofthe‘perfsonar-tools’package– Installationinstructionsat:http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/
11/13/1619
BIOSSettings
• DCA/IOAT/DDIO:ON– AllowstheNICtodirectlyaddressthecacheinDMAtransfers
• PCIe MaxReadRequest:Turnitupto4096,butourresultssuggestitdoesn’tseemtohurtorhelp
• Turboboost:ON
• Hyperthreading:OFF– AddedexcessivevariabilityinLANperformance(51Gto77G)
11/13/1620
FQon100GHosts
11/13/1621
100GHost,ParallelStreams:nopacingvs 20Gpacing
11/13/1622
WealsoseeconsistentlossontheLANwith4streams,nopacingPacketlossduetosmallbuffersinDellZ9100switch?
100GHostto10GHost
11/13/1623
FQasafixforunderbuffereddevices
11/13/1624
Findingtheoptimalsendingrate
• Ifiperf3testsshowlotsofretransmits,trygraduallyreducingthesendratebwctl –c bwctl100g.sc16.orgbwctl –c bwctl100g.sc16.org –b 20Gbwctl –c bwctl100g.sc16.org –b 15G
• Thenconfigureyourhosttousethatasamaxsendrate:/sbin/tc qdisc add dev eth1 root fq maxrate 15gbit
11/13/1625
Summaryofour100Gresults
• NewEnhancementstoLinuxKernelmaketuningeasieringeneral.
• Afewofthestandard10Gtuningknobsnolongerapply
• TCPbufferautotuningdoesnotworkwell100GLAN
• Usethe‘performance’CPUgovernor
• UseFQPacingtomatchreceivehostspeedifpossible
• ImportanttobeusingtheLatestdriverfromMellanox– version:3.3-1.0.4(03Jul2016),firmware-version:12.16.1020
11/13/1626
What’snextintheTCPworld?
• TCPBBR(BottleneckBandwidthandRTT)fromGoogle– https://patchwork.ozlabs.org/patch/671069/– GoogleGroup:https://groups.google.com/forum/#!topic/bbr-dev
• AdetaileddescriptionofBBRpublishedinACMQueue,Vol.14No.5,September-October2016:– "BBR:Congestion-BasedCongestionControl".
• Googlereports2-4ordersofmagnitudeperformanceimprovementonapathwith1%lossand100msRTT.– Sampleresult:cubic:3.3Mbps,BBR:9150Mbps!!– EarlytestingonESnetlessconclusive,butseemstohelponsomepaths
11/13/1627
InitialBBRTCPresults(bwctl,3streams,40sectest)RemoteHost Throughput Retransmits
perfsonar.nssl.noaa.gov htcp:183bbr:803
htcp:1070bbr:240340
kstar-ps.nfri.re.kr htcp:4301bbr:4430
htcp:1641bbr:98329
ps1.jpl.net htcp:940bbr:935
htcp:1247bbr:399110
uhmanoa-tp.ps.uhnet.net htcp:5051bbr:3095
htcp:5364bbr:412348
11/13/1628
Variesbetween4xbetterand30%worse,allwithWAYmoreretransmits.
MoreInformation
• http://fasterdata.es.net/host-tuning/packet-pacing/• http://fasterdata.es.net/host-tuning/100g-tuning/• TalkonSwitchBuffersizeexperiments:
– http://meetings.internet2.edu/2015-technology-exchange/detail/10003941/
• Mellanox TuningGuide:– https://community.mellanox.com/docs/DOC-1523
• Email:[email protected]
11/13/1629
ExtraSlides
11/13/1630
FQBackground
• Lotsofdiscussionaround‘bufferbloat’startingin2011– https://www.bufferbloat.net/
• Googlewantedtobeabletogethigherutilizationontheirnetwork– Paper:“B4:ExperiencewithaGlobally-DeployedSoftwareDefinedWAN,SIGCOMM2013
• GooglehiredsomeverysmartTCPpeople• VanJacobson,MattMathis,EricDumazet,andothers
• Result:LotsofimprovementstotheTCPstackin2013-14,includingmostnotablythe‘fairqueuing’pacer
11/13/1631
Benchmarkingvs.ProductionHostSettings
Therearesomesettingsthatwillgiveyoumoreconsistentresultsforbenchmarking,butyoumaynotwanttorunonaproductionDTNBenchmarking:• UseaspecificcoreforIRQs:
/usr/sbin/set_irq_affinity_cpulist.sh 8 ethN• Useafixedclockspeed(settothemaxforyourprocessor):
– /bin/cpupower -c all frequency-set -f 3.4GHzProductionDTN:
/usr/sbin/set_irq_affinity_bynode.sh 1 ethN/bin/cpupower frequency-set -g performance
11/13/1632
FastHosttoSlowhost
11/13/1633
Throttledthereceivehostusing‘cpupower’command:/bin/cpupower -c all frequency-set -f 1.2GHz
AsmallamountofpacketlossmakesahugedifferenceinTCPperformance
MetroArea
Local(LAN)
Regional Continental
International
Measured(TCPReno) Measured(HTCP) Theoretical(TCPReno) Measured(noloss)
Withloss,highperformance beyondmetrodistancesisessentiallyimpossible
TCP’sCongestionControl
© 2015 Internet2
50ms simulated RTTCongestion w/ 2Gbps UDP trafficHTCP / Linux 2.6.32
SlidefromMichaelSmitasin,LBLnet
FairQueuingandandSmallSwitchBuffersTCPThroughputonSmallBufferSwitch(Congestionw/2GbpsUDPbackgroundtraffic)
RequiresCentOS 7.2orhigher
tc qdisc add dev EthN root fqEnableFairQueuing
PacingsideeffectofFairQueuingyields~1.25Gbpsincreaseinthroughput@10Gbpsonourhosts
TSOdifferencesstillnegligibleonourhostsw/IntelX520
SlidefromMichaelSmitasin,LBL
Moreexamplesofpacinghelping
ParallelStreamTest1Leftside:
sumof4streams
Rightside:tput ofeachstream
StreamsappeartobemuchbetterbalancedwithFQ,pacingto2.4performedbest
Runyourowntests
• FindaremoteperfSONARhostonapathofinterest– Mostofthe2000+worldwideperfSONARhostswillaccepttests
• See:http://stats.es.net/ServicesDirectory/
• Runsometests– bwctl-chostname-t60--parsable >results.json
• ConvertJSONtognuplot format:– https://github.com/esnet/iperf/tree/master/contrib