PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester1
Transport Benchmarking
Panel Discussion
Richard Hughes-Jones The University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks”
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester2
Packet Loss and new TCP Stacks TCP Response Function
Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester3
Packet Loss and new TCP Stacks TCP Response Function
UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel
Agreement withtheory good
Some new stacksgood at high loss rates
sculcc1-chi-2 iperf 13Jan05
1
10
100
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
A0 Theory
Series10
Scalable Theory
sculcc1-chi-2 iperf 13Jan05
0
100
200
300
400
500
600
700
800
900
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vable
thro
ughput
Mbit/
s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester4
TCP Throughput – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106
High-SpeedRapid recovery
ScalableVery fast recovery
StandardRecovery would
take ~ 20 mins
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester5
MTU and Fairness
Two TCP streams share a 1 Gb/s bottleneck RTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %
Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)
RR RRGbE GbE SwitchSwitch
Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE
1 GE1 GE
Host #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000
Time(s)
Thro
ughput
(Mbps)
MTU = 3000 Byte
Average over the life of the connection MTU = 3000 Byte
MTU = 9000 Byte
Average over the life of the connection MTU = 9000 Byte
Sylvain Ravot DataTag 2003
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester6
RTT and Fairness
SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)
CERN (GVA)CERN (GVA)
RR RRGbE GbE SwitchSwitch
Host #1Host #1
POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE
1 GE1 GE
Host #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
RRPOS 10POS 10 Gb/sGb/sRR10GE10GE
Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Link utilization = 71,6 %
Throughput of two streams with different RTT sharing a 1Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000 7000
Time (s)
Thro
ughput
(Mbps)
RTT=181ms
Average over the life of the connection RTT=181ms
RTT=117ms
Average over the life of the connection RTT=117ms
Sylvain Ravot DataTag 2003
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester7
Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)
Used iperf/TCP and UDT/UDP to generate traffic
Each run was 16 minutes, in 7 regions
TCP Sharing & Recovery: Methodology (1Gbit/s)
Ping 1/s
Iperf or UDT
ICMP/ping traffic
TCP/UDPbottleneck
iperf
SLACCaltech/UFL/CERN
2 mins 4 mins
Les Cottrell PFLDnet 2005
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester8
Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor
throughput Unequal sharing
TCP Reno single stream
Congestion has a dramatic effect
Recovery is slow
Increase recovery rate
SLAC to CERN
RTT increases when achieves best throughput
Les Cottrell PFLDnet 2005
Remaining flows do not take up slack when flow removed
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester9
Hamilton TCP One of the best performers
Throughput is high Big effects on RTT when achieves best throughput Flows share equally
Appears to need >1 flow toachieve best throughput
Two flows share equally
SLAC-CERN
> 2 flows appears less stable
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester10
Implementation problems: SACKs … Look into what’s happening at the algorithmic level e.g. with web100:
Strange hiccups in cwnd only correlation is SACK arrivals The SACK Processing is inefficient for large bandwidth delay products
Sender write queue (linked list) walked for: Each SACK blockTo mark lost packetsTo re-transmit
Processing so long input Q becomes full Get Timeouts
Scalable TCP on MB-NG with 200mbit/sec CBR Background
Yee-Ting Li
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester11
TCP Stacks & CPU Load Real User problem! End host TCP flow at 960 Mbit/s with rtt 1 ms falls to 770 Mbit/s when rtt 15 ms
mk5-606-g7_10Dec05
0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00
100.00
0 2 4 6 8 10 12 14 16 18 20nice large value - low priority
% C
PU
mo
de
se
nd
kernel
user
nice
idle
no CPU load
0
200
400
600
800
1000
0 2 4 6 8 10 12 14 16 18 20nice large value - low priority
Thro
ughput
Mbit/s
no CPU load
1.2GHz PIII rtt 1 ms TCP iperf 980 Mbit/s
Kernel mode 95% Idle 1.3 % CPULoad with nice priority
Throughput falls as priorityincreases
No Loss No Timeouts
Not enough CPU power
mk5-606-g7_17Jan05
0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00
100.00
0 2 4 6 8 10 12 14 16 18 20nice large value - low priority
% C
PU
mo
de
se
nd
kernel
user
nice
idle
no CPU load
0
200
400
600
800
1000
0 2 4 6 8 10 12 14 16 18 20nice large value - low priority
Thro
ughput
Mbit/s
no CPU load
2.8 GHz Xeon rtt 1 ms TCP iperf 916 Mbit/s
Kernel mode 43% Idle 55% CPULoad with nice priority
Throughput constant as priority increases
No Loss No Timeouts
Kernel mode includes TCP stackand Ethernet driver
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester12
Check out the end host: bbftp What is the end-host doing with your protocol? Look at the PCI-X buses 3Ware 9000 controller RAID0 1 Gbit Ethernet link 2.4 GHz dual Xeon ~660 Mbit/s
PCI-X bus with RAID Controller
PCI-X bus with Ethernet NIC
Read from diskfor 44 ms every 100ms
Write to Networkfor 72 ms
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester13
Transports for LightPaths For a Lightpath with a BER 10-16 i.e. a packet loss rate 10-12 or
1 loss in about 160 days, what do we use?
Host to host Lightpath One Application No congestion Lightweight framing
Lab to Lab Lightpath Many application share Classic congestion points TCP stream sharing and recovery Advanced TCP stacks
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester14
Transports for LightPaths
Some applications suffer when using TCP may prefer to use UDP DCCP XCP …
E.g. With e-VLBI the data wave-front gets distorted and correlation fails
Consider & include other transport layer protocols when defining tests.
User Controlled Lightpaths Grid Scheduling of
CPUs & Network Many Application flows No congestion Lightweight framing
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester15
A Few Items for Discussion Achievable Throughput Sharing link Capacity (OK what is sharing?) Convergence time Responsiveness rtt fairness (OK what is fairness?) mtu fairness TCP friendliness Link utilisation (by this flow or all flows) Stability of Achievable Throughput Burst behaviour Packet loss behaviour Packet re-ordering behaviour Topology – maybe some “simple” setups Background or cross traffic - how realistic is needed? – what protocol mix? Reverse traffic Impact on the end host – CPU load, bus utilisation, Offload Methodology – simulation, emulation and Real links ALL help
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester16
Any Questions?
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester17
Backup Slides
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester18
10 Gigabit Ethernet: Tuning PCI-X 16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc
Measured times Times based on PCI-X times from
the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s
0
5
10
15
20
25
30
35
40
45
50
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e u
s
0
1
2
3
4
5
6
7
8
9
PC
I-X
Tra
nsfe
r ra
te G
bit/s
Measured PCI-X transfer time usexpected time usrate from expected time Gbit/s Max throughput PCI-X
Kernel 2.6.1#17 HP Itanium Intel10GE Feb04
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
mmrbc1024 bytes
mmrbc2048 bytes
mmrbc4096 bytes5.7Gbit/s
mmrbc512 bytes
CSR Access
PCI-X Sequence
Data Transfer
Interrupt & CSR Update
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester19
More Information Some URLs 1 UKLight web site: http://www.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester20
Lectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm www.cs.pdx.edu/~jrb/tcpip.lectures.html www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm
Encylopaedia http://www.freesoft.org/CIE/index.htm
TCP/IP Resources www.private.org.il/tcpip_rl.html
Understanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html
Configuring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt
Assigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols
More Information Some URLs 2
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester21
High Throughput DemonstrationsManchester rtt 6.2 ms (Geneva) rtt 128 ms
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSRCisco GSRCisco7609
Cisco7609
London (Chicago)
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester22
Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s
High Performance TCP – MB-NG
Standard HighSpeed Scalable
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester23
FAST TCP vs newReno
Traffic flow Channel #1 : newRenoTraffic flow Channel #1 : newReno Traffic flowTraffic flow Channel #2: FASTChannel #2: FAST
Utilization: 70%Utilization: 70%
Utilization:Utilization:
90%90%
PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester24
Fast
As well as packet loss, FAST uses RTT to detect congestion RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others
SLAC-CERN
Big drops in throughput which take several seconds to recover from
2nd flow never gets equal share of bandwidth