Linux TCP/IP Tuning

Post on 12-Nov-2014

8,930 views 2 download

Tags:

transcript

Copyright 2004 OSDL, All rights reserved.

Analyzing TCP Performance

Sr. Staff EngineerLinux Kongress 2004

2004-09-09

Stephen Hemminger

Copyright 2004 OSDL, All rights reserved. - 2 -

Agenda

■ Introduction■ TCP for muggles■ Engineering Process■ Problem examples■ Network Tools■ Wrapup

Copyright 2004 OSDL, All rights reserved. - 3 -

Outside of scope

■ Non TCP protocols■ SCTP, multicast, etc

■ Queuing theory - “no math”■ Hardware and product comparisons

Copyright 2004 OSDL, All rights reserved. - 4 -

My Background

■ Did TCP back in the “old school”■ BSD 4.2, Ethernet■ SMP Unix versions of OSI, Netware, Appletalk, ...■ Plan9 Hypercube communication

■ Linux■ Incorporation of TCP research in 2.6 kernel■ Performance tests for LWE■ Wizard gap

Copyright 2004 OSDL, All rights reserved. - 5 -

Limits of my knowledge

■ Only worked with current Linux (2.4/2.6)■ Will mention tools here that I have not used

extensively■ Involved in development of Linux not deployment

or research

Copyright 2004 OSDL, All rights reserved. - 6 -

Agenda

■ Introduction■ TCP for muggles■ Engineering Process■ Problem examples■ Network Tools■ Wrapup

Copyright 2004 OSDL, All rights reserved. - 7 -

TCP for “muggles”

■ connection establishment■ slow start■ windows■ congestion control■ silly window

Copyright 2004 OSDL, All rights reserved. - 8 -

Connection establishment

SYN

SYN+ACK

Data 1(10)

Ack 11

connect

Client Server

write

accept

read

Copyright 2004 OSDL, All rights reserved. - 9 -

ethereal

Copyright 2004 OSDL, All rights reserved. - 10 -

tcpdump trace

13:28:21.745624 IP 172.20.1.60.38052 > 216.239.39.99.http: S 1765497548:1765497548(0)win 5840 <mss 1460,sackOK,timestamp 1563951453 0,nop,wscale 7>

13:28:21.831935 IP 216.239.39.99.http > 172.20.1.60.38052: S 227058185:227058185(0)ack 1765497549 win 8190 <mss 1460>

13:28:21.832035 IP 172.20.1.60.38052 > 216.239.39.99.http: . ack 1 win 584013:28:21.832321 IP 172.20.1.60.38052 > 216.239.39.99.http: P 1:126(125) ack 1 win 584013:28:21.939237 IP 216.239.39.99.http > 172.20.1.60.38052: . ack 126 win 3146013:28:21.972448 IP 216.239.39.99.http > 172.20.1.60.38052: P 1:485(484) ack 126 win 3146013:28:21.972529 IP 172.20.1.60.38052 > 216.239.39.99.http: . ack 485 win 643213:28:21.973016 IP 172.20.1.60.38052 > 216.239.39.99.http: F 126:126(0) ack 485 win 6432

Copyright 2004 OSDL, All rights reserved. - 11 -

Flow control

Data 1011 (1400)

ACK 1010 (5000)

Ack 6010 (1000)

write

read (1000)

Data 3811 (1400)

Data 2411 (1400)Data 5211 (800)

Ack 6010 (0)

Copyright 2004 OSDL, All rights reserved. - 12 -

Retransmission

Data 1

Ack 1Ack 1

write

Data 2

Multiple ack's = fast retransmit

Copyright 2004 OSDL, All rights reserved. - 13 -

Tcptrace

http://tcptrace.org

Tool to convert captured data into graphs■ Time sequence graph■ Throughput■ RTT

Lots more than time to cover here!

Copyright 2004 OSDL, All rights reserved. - 14 -

Xplot

http://xplot.org■ Takes plot command scripts■ Mouse

■ Zoom – drag with the left button■ Zoom out – click the left button ■ Scroll – drag with middle button■ Dump – shift-left button produces postscript

■ Shift-middle and shift-right also

Copyright 2004 OSDL, All rights reserved. - 15 -

Time Sequence Graph

Copyright 2004 OSDL, All rights reserved. - 16 -

Copyright 2004 OSDL, All rights reserved. - 17 -

Windows & Buffering

■ Used to isolate TCP from application read/write■ Used for congestion control■ Upper bound determined by system parameters

Copyright 2004 OSDL, All rights reserved. - 18 -

Congestion window

■ slow start■ Window normally starts small■ Grows in response to ack

■ congestion control■ Packet loss = congestion

Copyright 2004 OSDL, All rights reserved. - 19 -

Silly Window

Data (2000)

Ack [10]write8k bytes

Ack [2000]

Read8k bytes

“Hey, I am not going to try and send this data now give me a bigger window first”

OK, thanks

Copyright 2004 OSDL, All rights reserved. - 20 -

Model of TCP networks

Network

Send Window

Sender

Receive Window

Receiver

Data

Ack

BDP = Bandwidth (bytes/sec) * Delay (secs/unit)

Copyright 2004 OSDL, All rights reserved. - 21 -

BDP - Bandwidth Delay Product

■ BDP = amount of data in transit■ Examples

■ DSL/Cable modem (international)

1,000,000 bit/sec * 1/8 byte/bit * 500 ms = 62500 bytes

■ Gigabit across US

1,000,000,000 bit/sec * 1/8 byte/bit * 70 ms = 8,75 Mbytes

Copyright 2004 OSDL, All rights reserved. - 22 -

0.1 1 10 100 10000.1

1

10

100

1000

Delay (ms)

Ban

dwid

thM

bits

/sec

Bandwidth Delay Product (BDP)

8K1M64K

Broadband

ResearchLAN

Copyright 2004 OSDL, All rights reserved. - 23 -

Internet

■ Router queues■ Delays

■ Speed of light (70ms coast/coast)■ Slow routers

■ Packet correlation, sizes■ DoS

Copyright 2004 OSDL, All rights reserved. - 24 -

Extensions for larger windows

■ TCP Selective Acknowlegement (SACK) RFC2018

■ Don't have to retransmit everything

■ Window scaling (RFC1323)■ Window size multiplied by 2n

■ Protection Against Wrapped Sequence (PAWS)■ Timestamp inside each packet

Copyright 2004 OSDL, All rights reserved. - 25 -

TCP options negotiation 1

IP 172.20.1.60.32820 > 216.239.39.99.http: S 3599527174:3599527174(0) win 5840<mss 1460,sackOK,timestamp 2519711 0,nop,wscale 2>

IP 216.239.39.99.http > 172.20.1.60.32820: S 3820474812:3820474812(0) ack 3599527175 win 8190 <mss 1460>IP 172.20.1.60.32820 > 216.239.39.99.http: . ack 1 win 5840IP 172.20.1.60.32820 > 216.239.39.99.http: P 1:126(125) ack 1 win 5840

Window scale by 4

But server doesn't support scaling

Copyright 2004 OSDL, All rights reserved. - 26 -

TCP options negotiation 2

IP 172.20.1.60.32823 > 65.172.181.13.http: S 4120108902:4120108902(0) win 5840 <mss 1460,sackOK,timestamp 3036627 0,nop,wscale 2>

IP 65.172.181.13.http > 172.20.1.60.32823: S 2295773021:2295773021(0) ack 4120108903 win 5792

<mss 1460,sackOK,timestamp 1818411318 3036627,nop,wscale 0>IP 172.20.1.60.32823 > 65.172.181.13.http: . ack 1 win 1460 <nop,nop,timestamp 3036628 1818411318>IP 172.20.1.60.32823 > 65.172.181.13.http: P 1:144(143) ack 1 win 1460 <nop,nop,timestamp 3036628 1818411318>

Window scale by 4

Your scaling is okay, but don't scale mine

Copyright 2004 OSDL, All rights reserved. - 27 -

Linux TCP window tuning

■ Send window - net.ipv4.tcp_wmem■ three values : initial default max

■ default is 4K 16K 128K■ also limited by net.core.wmem_max

■ Receive window – net.ipv4.tcp_rmem

■ three values : initial default max■ default is 4K 85K 170K

■ also limited by net.core.rmem_max

Copyright 2004 OSDL, All rights reserved. - 28 -

Linux TCP window tuning

■ Overall memory – net.ipv4.tcp_mem■ three values : low pressure max■ automatic value based on system memory

■ Application window – net.ipv4.tcp_app_mem

■ reserved space to handle slow applications

Copyright 2004 OSDL, All rights reserved. - 29 -

But!

■ Some firewalls and routers are buggy■ Corrupt window scale change N to 0■ Forget to track state, or read RFC wrong■ Connections will hang because initial window looks

like a silly window■ 1% of the net is buggy..

■ Linux 2.6.9 chooses window scale based on maximum possible receive window

■ Default tcp_rmem => window scale of 2■ Buggy devices will see ¼ of the real window

Copyright 2004 OSDL, All rights reserved. - 30 -

Break

Copyright 2004 OSDL, All rights reserved. - 31 -

Agenda

■ Introduction■ TCP for muggles■ Engineering Process■ Problem examples■ Network Tools■ Wrapup

Copyright 2004 OSDL, All rights reserved. - 32 -

Performance Engineering process

■ Define what your goal■ Capture information■ Analyze and form hypothesis■ Prototype to validate hypothesis

■ If successful■ Make changes on production system■ Report problems or patches to others

Copyright 2004 OSDL, All rights reserved. - 33 -

Goal setting

■ Know what is possible:■ bus bandwidth, network latency, etc.

■ Know your application■ Compare with similar applications

Copyright 2004 OSDL, All rights reserved. - 34 -

TCP performance testing

■ Goal: Improve TCP performance over high bandwidth * delay links

■ Plan:■ New TCP congestion control■ Validate and test

Copyright 2004 OSDL, All rights reserved. - 35 -

Testing TCP over WAN

■ Want to test performance of TCP over high BDP links

■ Can't afford a 10Gbit trans-continental link■ Proposal: emulate network delay over 1Gbit

Ethernet

Copyright 2004 OSDL, All rights reserved. - 36 -

Existing network emulation tools

■ Dummynet

http://info.iet.unipi.it/~luigi/ip_dummynet/I don't want to setup separate FreeBSD machine

■ NISTnethttp://snad.ncsl.nist.gov/itg/nistnet/

Only on 2.4 and not ready to be in main tree

Copyright 2004 OSDL, All rights reserved. - 37 -

Netem

http://developer.osdl.org/shemminger/netem■ Started out as simple delay only hack■ Grown up to do all the functionality of NISTnet

Ethernet (eth0)

netem

IP

TCP

Copyright 2004 OSDL, All rights reserved. - 38 -

Current TCP research

■ Alternative TCP congestion■ Vegas■ Westwood■ Binary Increase Congestion Control (BIC)

■ Research community based around Web100

Copyright 2004 OSDL, All rights reserved. - 39 -

TCP Reno

■ Standard default in 2.4/2.6■ Adjusts congestion window based on packet loss■ Slow start – window grows slowly■ Additive Increase window on each Ack■ Multiplicative Decrease on loss

Copyright 2004 OSDL, All rights reserved. - 40 -

TCP Vegas

■ Original work by Larry Peterson■ Patches existed for 2.2, 2.4 and part of web100■ sysctl net.ipv4.tcp_cong_avoid

■ Measure bandwidth based on RTT■ Adjust congestion window on bandwidth■ Avoids packet loss

Copyright 2004 OSDL, All rights reserved. - 41 -

TCP Westwood

■ Work by Caludio Casetti■ Patches for 2.4 by Angelo Dell'Aera■ sysctl net.ipv4.tcp_westwood

■ Focused on wireless ■ packet loss != congestion

■ Measure bandwidth based on RTT■ Use normal Reno till congestion then adjust

congestion window based on bandwidth

Copyright 2004 OSDL, All rights reserved. - 42 -

Binary Increase Congestion Control (BIC)

■ Work by Lisung Xu■ Patches for Web100 (2.4)■ sysctl net.ipv4.tcp_bic

■ Designed for best high speed networks■ Modification of Reno■ Use additive increase when congestion window

is large■ Binary search increase when window is small

Copyright 2004 OSDL, All rights reserved. - 43 -

Tuning

■ Default tcp parameters not big enough ■ Need bigger send and receive window

■ Send window autosized based on rtt already■ Receive window autosizing was done in Web100

Copyright 2004 OSDL, All rights reserved. - 44 -

Receiver Tuning

■ Patches from John Heffner■ sysctl net.ipv4.tcp_moderate_rcvbuf

■ Dynamic Right Sizing (DRS)■ adjust receive window based on RTT■ If application doesn't set window then do it for them■ Window will grow from default to max

Copyright 2004 OSDL, All rights reserved. - 45 -

Receiver auto-tuning

0 50 100 150 2000

200

400

600

800

1000

Default

Auto Tuned

Delay (ms)

Thr

ough

put (

Mbi

ts/s

ec)

Copyright 2004 OSDL, All rights reserved. - 46 -

Throughput vs Delay (initial run)

0

100

200

300

400

500

600

700

800

0 50 100 150 200

Ba

nd

wid

th (

Mb

its/s

ec)

Delay (ms)

RenoVegas

WestwoodBic

Copyright 2004 OSDL, All rights reserved. - 47 -

What's happening

■ NAPI■ Driver API to allow avoiding interrupts■ Trades off latency for overall performance

■ E1000 driver■ Uses NAPI for transmit

Answer: Transmit ring gets full and driver flow blocks

Solution: set TxDescriptors=1000

Copyright 2004 OSDL, All rights reserved. - 48 -

Thorughput vs Delay (rerun)

0 25 50 75 100 125 150 175 2000

100

200

300

400

500

600

700

800

Reno

Vegas

Westwood

BIC

Delay (ms)

Thr

oug

hput

(bi

ts/s

ec)

Copyright 2004 OSDL, All rights reserved. - 49 -

Performance still slow

■ Vegas and Westwood are terrible■ Not at full link speed■ Performance falling off with delay

Copyright 2004 OSDL, All rights reserved. - 50 -

Vegas trace with 100ms delay

Copyright 2004 OSDL, All rights reserved. - 51 -

Vegas detail

Copyright 2004 OSDL, All rights reserved. - 52 -

Westwood (70ms)

Copyright 2004 OSDL, All rights reserved. - 53 -

Westwood detail

Copyright 2004 OSDL, All rights reserved. - 54 -

BIC trace (100ms)

Copyright 2004 OSDL, All rights reserved. - 55 -

BIC detail (100ms)

Copyright 2004 OSDL, All rights reserved. - 56 -

How to squeeze out more performance

■ Large MTU (4k) + 63%■ LAN driver not-module up to 10%■ Turn off timestamps + 4%■ Bind IRQ to processor varies

Copyright 2004 OSDL, All rights reserved. - 57 -

Congestion more work

■ Vegas doesn't use available window■ Does it under estimate bandwidth?

■ Westwood■ Another bandwidth problem

■ BIC■ When does it make into binary mode?■ What is holding back window?

■ Netem■ Higher resolution? Packet groups?

Copyright 2004 OSDL, All rights reserved. - 58 -

Break

Copyright 2004 OSDL, All rights reserved. - 59 -

Agenda

■ Introduction■ TCP for muggles■ Engineering Process■ Problem examples■ Network Tools■ Wrapup

Copyright 2004 OSDL, All rights reserved. - 60 -

Other tools

■ Information about■ ISP connection■ Sockets open

■ Testing infrastructure■ More data capture■ Monitoring

Copyright 2004 OSDL, All rights reserved. - 61 -

Tools: basic

■ Network path information■ Ping – send icmp echo

■ Measure of round trip time and loss■ Can be blocked by firewall

■ Traceroute – use IP source routing■ Usually blocked now

■ Pathcapture (pcap)■ Bandwidth and delay measurement

Copyright 2004 OSDL, All rights reserved. - 62 -

Tools: Network interface

■ ifconfig■ Basic statistics, packets sent/received/errors

■ ip -stats link■ Alternate newer, may have more info

■ SNMP■ Remote access to same information■ Slightly more work

Copyright 2004 OSDL, All rights reserved. - 63 -

Tools: Sockets

■ Netstat■ TCP statistics■ Open sockets

■ Ss■ More statistics available (rtt, etc)

■ Recvmsg■ Application can see TCP info (cmsg)

Copyright 2004 OSDL, All rights reserved. - 64 -

Tools: test servers

■ SYN testtelnet syntest.psc.edu 7960

■ TCP bandwidthhttp://www.epm.ornl.gov/~dunigan/java/misc/tcpbw.html

http://dslreports.com

■ ANL network confighttp://miranda.ctd.anl.gov:7123

■ Path MTUhttp://www.ncne.org/jumbogram/mtu_discovery.php

Copyright 2004 OSDL, All rights reserved. - 65 -

Tools: testing

■ Ttcp■ Basic send /receive throughput

■ Iperf■ Longer running tests and turnaround

■ Netperf■ Includes cpu and other statistics

■ Dbs■ Multiclient testing

Copyright 2004 OSDL, All rights reserved. - 66 -

Tools: monitoring

■ Ntop■ Measure of network activity by service■ Nice web interface

■ Mailgraph■ Long term mail statistics

■ Web server activity log analysis

Copyright 2004 OSDL, All rights reserved. - 67 -

Tools: data capture

■ Tcpdump■ Filter packets by protocol, address, etc■ Decode many protcols

■ Ethereal■ GUI interface

■ RMON■ Remote monitoring

■ Kismet■ Wireless activity

Copyright 2004 OSDL, All rights reserved. - 68 -

Tools: generators

■ Pktgen■ Kernel level packet generation■ Can generate maximum hardware packet rate

■ Network packet generator■ Application level

Copyright 2004 OSDL, All rights reserved. - 69 -

Tools: simulation

■ Ns■ Describe overall system■ Event based simulation■ Used for protocol analysis

■ SSFnet■ More detailed models of real hardware

Copyright 2004 OSDL, All rights reserved. - 70 -

Tools: client simulator

■ Web■ SPECweb, Apache (as), httpload

■ NFS■ Nfsstone

■ FTP■ Dkftpbench

Copyright 2004 OSDL, All rights reserved. - 71 -

Conclusion

■ Data capture can provide clues of:■ Application problems■ Device problems■ TCP/IP problems

■ Nothing is ever simple