Slide: 1Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
1
TCP/IP Overview & Performance
Richard Hughes-JonesThe University of Manchester
MB - NG
Slide: 2Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
2
TCP (Reno) – What’s the problem?
TCP has 2 phases: Slowstart
Probe the network to estimate the Available BWExponential growth
Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly”
AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is due
in part to the TCP congestion control algorithm. For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½
Packet loss is a killer !!
Slide: 3Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
3
TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:
for rtt of ~200 ms:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e to
rec
ove
r se
c
10Mbit100Mbit1Gbit2.5Gbit10Gbit
Slide: 4Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
4
Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP
a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner
for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect
is that there is not such a decrease in throughput. Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.
Fast TCP
Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.
HSTCP-LP, H-TCP, BiC-TCP
Slide: 5Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
5
Packet Loss and new TCP Stacks TCP Response Function
Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
Slide: 6Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
6
Packet Loss and new TCP Stacks TCP Response Function
UKLight London-Chicago-London rtt 180 ms 2.6.6 Kernel
Agreement withtheory good
sculcc1-chi-2 iperf 13Jan05
1
10
100
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vabl
e th
roug
hput
M
bit/s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
A0 Theory
Series10
Scalable Theory
sculcc1-chi-2 iperf 13Jan05
0
100
200
300
400
500
600
700
800
900
1000
100100010000100000100000010000000100000000Packet drop rate 1 in n
TC
P A
chie
vabl
e th
roug
hput
M
bit/s
A0 1500
A1 HSTCP
A2 Scalable
A3 HTCP
A5 BICTCP
A8 Westwood
A7 Vegas
Slide: 7Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
7
High Throughput Demonstrations
Manchester (Geneva)
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSR
Cisco GSR
Cisco7609
Cisco7609
London (Chicago)
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
Slide: 8Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
8
Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s
High Performance TCP – MB-NG
Standard HighSpeed Scalable
Slide: 9Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
9
High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106
High-SpeedRapid recovery
ScalableVery fast recovery
StandardRecovery would
take ~ 20 mins
Slide: 10Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
10
On the way to Higher Bandwidth
Slide: 11Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
11
End Hosts & NICs SuperMicro P4DP6
Latency
Throughput
Bus Activity
Use UDP packets from udpmon to characterise Host & NIC
SuperMicro P4DP6 motherboardDual Xenon 2.2GHz CPU400 MHz System bus66 MHz 64 bit PCI bus
gig6-7 Intel pci 66 MHz 27nov02
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40Transmit Time per frame us
Recv
Wire
rate
M
bits/
s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
64 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
900
170 190 210
Latency us
N(t)
512 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
170 190 210Latency us
N(t)
1024 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
1400 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
Intel 64 bit 66 MHz
y = 0.0093x + 194.67
y = 0.0149x + 201.75
0
50
100
150
200
250
300
0 500 1000 1500 2000 2500 3000Message length bytes
Late
ncy u
s
Send PCI
Receive PCI
1400 bytes to NIC
1400 bytes to memory
PCI Stop Asserted
Slide: 12Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
12
Network switch limits behaviour End2end UDP packets from udpmon
Only 700 Mbit/s throughput
Lots of packet loss
Packet loss distributionshows throughput limited
w05gva-gig6_29May04_UDP
0
100
200
300
400
500
600
700
800
900
1000
0 5 10 15 20 25 30 35 40Spacing between frames us
Recv W
ire r
ate
Mb
its/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
w05gva-gig6_29May04_UDP
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
w05gva-gig6_29May04_UDP wait 12us
0
2000
4000
6000
8000
10000
12000
14000
0 100 200 300 400 500 600Packet No.
1-w
ay d
ela
y u
s
0
2000
4000
6000
8000
10000
12000
14000
500 510 520 530 540 550Packet No.
1-w
ay d
elay
us
Slide: 13Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
13
TCP Window Scale factor not set correctly
SC2004 London-Chicago-London tests Server quality hosts – 2.8 GHz Dual Xeon; 133 MHz PCI-X bus TCP window scale factor should allow the pipe to be filled Delay*BW 22 Mbytes Web100 output shows:
Cwnd dows not open Data set at line speed
but as 1 burst/rtt Data stops at Cwnd Average throughput:
100 Mbit/s Limited by sender
Kernel configuration problem
0
500
1000
1500
2000
2500
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time ms
TC
PA
chiv
e M
bit
/s
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
Slide: 14Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
14
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Network & Disk Interactions Hosts: Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
Measure memory to RAID0 transfer rates with & without UDP trafficRAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k64ky=178-1.05x
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k
64k
y=178-1.05x
Disk write1735 Mbit/s
Disk write +1500 MTU UDP
1218 Mbit/sDrop 30%
Disk write +9000 MTU UDP
1400 Mbit/s
CPU load
Slide: 15Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
15
iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Average: Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)
BaBar on Production network Standard TCP Average: 425 Mbit/s DupACKs 350-400 – re-transmits
Slide: 16Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
16
Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:
Standard TCP Average 825 Mbit/s
Scalable TCP Average 875 Mbit/s
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
chiv
e M
bit
/s050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBW
AveBW
CurCwnd (Value)
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
chiv
e M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
Slide: 17Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
17
Parameters to Consider – Only some of them!
Server quality hosts Check that UDP packets use BW expected
Poor (old), or wrongly configured routers / switches Overloaded access links – campus / country
Hunt down packet loss at your desired sending rate Fill the pipe with packets in flight
set socket buffer to2* Delay*BW Kernel configuration settings:
Allow large socket buffer (TCP window) settings Set length of transmit queue large (~2000) TCP window scale factor should allow the pipe to be filled Disallow “Moderation” in TCP stack
Consider tuning off SACKs in 2.4.x and maybe up to 2.6.6 Large MTUs – reduces CPU load Enable Interrupt coalescence – reduces CPU load
Slide: 18Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
18
Real Time TCP in e-VLBI
Slide: 19Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
19
Does TCP delay the data ? Work in progress !! Send blocks of data (10kbytes) at regular intervals Drop every 10,000 packet Measure the arrival time of the data
mark5-g6_A0_10k_26Jan05
-7000
-6000
-5000
-4000
-3000
-2000
-1000
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
block number
del
t m
s
expect-send us
expect-recv us
0
20
40
60
80
100
120
140
0 10000 20000 30000 40000 50000 60000 70000 80000
pkts in
TC
PA
ch
ive M
bit
/s
InstaneousBW
BW in
Slide: 20Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
20
More Information Some URLs
MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:
www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004
Slide: 21Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
21
Backup Slides
Slide: 22Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
22
UKLight in the UK
Slide: 23Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
23
SC2004 UKLIGHT Overview
MB-NG 7600 OSR
Manchester
ULCC UKlight
UCL HEP UCL network
K2
Ci
Chicago Starlight
Amsterdam
SC2004
Caltech BoothUltraLight IP
SLAC Booth
Cisco 6509
UKlight 10GFour 1GE channels
UKlight 10G
Surfnet/ EuroLink 10GTwo 1GE channels
NLR LambdaNLR-PITT-STAR-10GE-16
K2
K2 Ci
Caltech 7600
Slide: 24Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
24
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
HW RAID
HW RAID
Slide: 25Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
25
Peak bandwidth 23.21Gbits/s 6.6 TBytes in 48 minutes
The Bandwidth Challenge at SC2003
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router traffic to Abilele
Phoenix-Chicago
Phoenix-Amsterdam
Phoenix - Amsterdam 4.35 Gbit HighSpeed TCPrtt 175 ms , window 200 MB
Slide: 26Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
26
Average Transfer Rates Mbit/sApp TCP Stack SuperMicro on
MB-NG
SuperMicro on
SuperJANET4
BaBar on
SuperJANET4
Iperf Standard 940 350-370 425
HighSpeed 940 510 570
Scalable 940 580-650 605
bbcp Standard 434 290-310 290
HighSpeed 435 385 360
Scalable 432 400-430 380
bbftp Standard 400-410 325 320
HighSpeed 370-390 380
Scalable 430 345-532 380
apache Standard 425 260 300-360
HighSpeed 430 370 315
Scalable 428 400 317
Gridftp Standard 405 240
HighSpeed 320
Scalable 335
Slide: 27Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
27
bbftp: Host & Network Effects 2 Gbyte file RAID5 Disks:
1200 Mbit/s read 600 Mbit/s write
Scalable TCP
BaBar + SuperJANET Instantaneous 220 - 625 Mbit/s
SuperMicro + SuperJANET Instantaneous
400 - 665 Mbit/s for 6 sec Then 0 - 480 Mbit/s
SuperMicro + MB-NG Instantaneous
880 - 950 Mbit/s for 1.3 sec Then 215 - 625 Mbit/s
Slide: 28Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
28
Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET
bbcp
bbftp
Apachie
Gridftp
Previous work used RAID0(not disk limited)
Slide: 29Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
29
Host, PCI & RAID Controller Performance
RAID0 (striped) & RAID5 (stripped with redundancy) 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard Disk: Maxtor 160GB 7200rpm 8MB Cache Read ahead kernel tuning: /proc/sys/vm/max-readahead
Slide: 30Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005
R. Hughes-Jones Manchester
30
RAID Controller PerformanceR
AID
0R
AID
5
Read Speed Write Speed