Slide: 1Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
1
Bringing High-Performance Networking to HEP users
Richard Hughes-Jones
Stephen Dallison, Nicola Pezzi, Yee-Ting Lee
MB - NG
Slide: 2Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
2
Peak bandwidth 23.21Gbits/s 6.6 TBytes in 48 minutes
The Bandwidth Challenge at SC2003
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router traffic to Abilele
Phoenix-Chicago
Phoenix-Amsterdam
Phoenix - Amsterdam 4.35 Gbit HighSpeed TCPrtt 175 ms , window 200 MB
Slide: 3Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
3
TCP (Reno) – What’s the problem? TCP has 2 phases: Slowstart & Congestion Avoidance AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is due
in part to the TCP congestion control algorithm. For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½
Time to recover from 1 lost packet for round trip time of ~100 ms:
Slide: 4Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
4
Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)
For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:
cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP
a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner
for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect
is that there is not such a decrease in throughput. Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.
Fast TCP
Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.
HSTCP-LP, H-TCP, BiC-TCP
Slide: 5Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
5
Packet Loss with new TCP Stacks TCP Response Function
Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel
MB-NG rtt 6ms DataTAG rtt 120 ms
Slide: 6Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
6
High Throughput Demonstrations
Manchester (Geneva)
man03lon01
2.5 Gbit SDHMB-NG Core
1 GEth1 GEth
Cisco GSR
Cisco GSR
Cisco7609
Cisco7609
London (Chicago)
Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz
Send data with TCPDrop Packets
Monitor TCP with Web100
Slide: 7Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
7
Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s
High Performance TCP – MB-NG
Standard HighSpeed Scalable
Slide: 8Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
8
High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106
High-SpeedRapid recovery
ScalableVery fast recovery
StandardRecovery would
take ~ 20 mins
Slide: 9Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
9
End Systems: NICs & Disks
Slide: 10Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
10
End Hosts & NICs SuperMicro P4DP6
Latency
Throughput
Bus Activity
Use UDP packets to characterise Host & NIC
SuperMicro P4DP6 motherboardDual Xenon 2.2GHz CPU400 MHz System bus66 MHz 64 bit PCI bus
gig6-7 Intel pci 66 MHz 27nov02
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40Transmit Time per frame us
Recv
Wire
rate
M
bits/
s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
64 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
900
170 190 210
Latency us
N(t)
512 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
170 190 210Latency us
N(t)
1024 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
1400 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
Intel 64 bit 66 MHz
y = 0.0093x + 194.67
y = 0.0149x + 201.75
0
50
100
150
200
250
300
0 500 1000 1500 2000 2500 3000Message length bytes
Late
ncy u
s
Send PCI
Receive PCI
1400 bytes to NIC
1400 bytes to memory
PCI Stop Asserted
Slide: 11Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
11
Host, PCI & RAID Controller Performance RAID5 (stripped with redundancy) 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard Disk: Maxtor 160GB 7200rpm 8MB Cache Read ahead kernel tuning: /proc/sys/vm/max-readahead = 512
RAID0 (stripped) Read 1040 Mbit/s, Write 800 Mbit/s
Disk – Memory Read Speeds Memory - Disk Write Speeds
Slide: 12Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
12
The performance of the end host / disks BaBar Case Study: RAID BW & PCI Activity
3Ware 7500-8 RAID5 parallel EIDE 3Ware forces PCI bus to 33 MHz BaBar Tyan to MB-NG SuperMicro
Network mem-mem 619 Mbit/s Disk – disk throughput bbcp
40-45 Mbytes/s (320 – 360 Mbit/s) PCI bus effectively full! User throughput ~ 250 Mbit/s
Read from RAID5 Disks Write to RAID5 Disks
Slide: 13Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
13
Data Transfer Applications
Slide: 14Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
14
The Tests (being) MadeApp TCP Stack SuperMicro on
MB-NG
SuperMicro on
SuperJANET4
BaBar on
SuperJANET4
Iperf Standard
HighSpeed
Scalable
bbcp Standard
HighSpeed
Scalable
bbftp Standard
HighSpeed
Scalable
apache Standard
HighSpeed
Scalable
Gridftp Standard
HighSpeed
Scalable
Slide: 15Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
15
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
HW RAID
HW RAID
Slide: 16Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
16
Topology of the Production Network
KeyGigabit Ethernet2.5 Gbit POS Access10 Gbit POS
man01
RAL Domain
Manchester Domain
ral01
HW RAID
HW RAID routers switches
3 routers2 switches
Slide: 17Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
17
Average Transfer Rates Mbit/sApp TCP Stack SuperMicro on
MB-NG
SuperMicro on
SuperJANET4
BaBar on
SuperJANET4
Iperf Standard 940 350-370 425
HighSpeed 940 510 570
Scalable 940 580-650 605
bbcp Standard 434 290-310 290
HighSpeed 435 385 360
Scalable 432 400-430 380
bbftp Standard 400-410 325 320
HighSpeed 370-390 380
Scalable 430 345-532 380
apache Standard 425 260 300-360
HighSpeed 430 370 315
Scalable 428 400 317
Gridftp Standard 405 240
HighSpeed 320
Scalable 335
Slide: 18Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
18
iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)
BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits
Slide: 19Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
19
bbftp: Host & Network Effects 2 Gbyte file RAID5 Disks:
1200 Mbit/s read 600 Mbit/s write
Scalable TCP
BaBar + SuperJANET Instantaneous 220 - 625 Mbit/s
SuperMicro + SuperJANET Instantaneous
400 - 665 Mbit/s for 6 sec Then 0 - 480 Mbit/s
SuperMicro + MB-NG Instantaneous
880 - 950 Mbit/s for 1.3 sec Then 215 - 625 Mbit/s
Slide: 20Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
20
bbftp: What else is going on? Scalable TCP
BaBar + SuperJANET
SuperMicro + SuperJANET
Congestion window – dupACK Variation not TCP related?
Disk speed / bus transfer Application
Slide: 21Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
21
Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET
bbcp
bbftp
Apachie
Gridftp
Previous work used RAID0(not disk limited)
Slide: 22Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
22
Motherboards NICs, RAID controllers and Disks matter The NICs should be well designed:
NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK) NIC/drivers: CSR access / Clean buffer management / Good interrupt handling
Worry about the CPU-Memory bandwidth as well as the PCI bandwidth Data crosses the memory bus at least 3 times
Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses 32 bit 33 MHz is too slow for Gigabit rates 64 bit 33 MHz > 80% used
Choose a modern high throughput RAID controller Consider SW RAID0 of RAID5 HW controllers
Need plenty of CPU power for sustained 1 Gbit/s transfers Work with Campus network engineers to eliminate bottlenecks and packet loss
High bandwidth link to your server Look for Access link overloading / old Ethernet equipment / flow limitation policies
Use of Jumbo frames, Interrupt Coalescence and Tuning the PCI-X bus helps New TCP stacks are stable and run with 10 Gigabit Ethernet NICs New stacks give better response & performance
Still need to set the tcp buffer sizes System maximums in collaboration with the sysadmin Socket sizes in the application
Application architecture & implementation is also important
Summary, Conclusions & Thanks
Slide: 23Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
23
More Information Some URLs
MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:
www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004
Slide: 24Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
24
Backup Slides
Slide: 25Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
25
SuperMicro P4DP6: Throughput Intel Pro/1000
Max throughput 950Mbit/s No packet loss
CPU utilisation on the receiving PC was ~ 25 % for packets > than 1000 bytes
30- 40 % for smaller packets
Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19
gig6-7 Intel pci 66 MHz 27nov02
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40Transmit Time per frame us
Recv
Wire
rate
M
bits
/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
gig6-7 Intel pci 66 MHz 27nov02
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Transmit Time per frame us
% C
PU
Idl
e R
ecei
ver
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
Slide: 26Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
26
SuperMicro P4DP6: Latency Intel Pro/1000
Some steps Slope 0.009 us/byte Slope flat sections : 0.0146
us/byte Expect 0.0118 us/byte
No variation with packet size FWHM 1.5 us Confirms timing reliable
Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19 Intel 64 bit 66 MHz
y = 0.0093x + 194.67
y = 0.0149x + 201.75
0
50
100
150
200
250
300
0 500 1000 1500 2000 2500 3000Message length bytes
Late
ncy u
s
64 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
900
170 190 210
Latency us
N(t)
512 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
170 190 210Latency us
N(t)
1024 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
1400 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
Slide: 27Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
27
SuperMicro P4DP6: PCI Intel Pro/1000
1400 bytes sent Wait 12 us ~5.14us on send PCI bus PCI bus ~68% occupancy ~ 3 us on PCI for data recv
CSR access inserts PCI STOPs NIC takes ~ 1 us/CSR CPU faster than the NIC !
Similar effect with the SysKonnect NIC
Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19
Slide: 28Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
28
Raid0 Performance (1) 3Ware 7500-8 RAID0
parallel EIDE Maxtor 3.5 Series
DiamondMax Plus 9 120 Gb ATA/133
Raid stripe size 64 bytes
WriteSlight increase with number of disks
Read
3 Disks OK
Write 100 MBytes/s Read 130 MBytes/s
Read Throughput and no. Disks Stripe 64k raid0
0
20
40
60
80
100
120
140
160
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
ytes
/s
2 disk
3 disk
4 disk
Write Throughput and no. Disks Stripe 64k raid0
0
20
40
60
80
100
120
140
160
180
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
ytes
/s
2 disk
3 disk
4 disk
Slide: 29Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
29
Raid0 Performance (2) Maxtor 3.5 Series
DiamondMax PLus 9 120 Gb ATA/133
No difference for Write
Larger Stripe lower the performance
Write 100 MBytes/s Read 120 MBytes/s
Disk-mem Write Throughput vs Stripe Size - 3 disk raid0
0
20
40
60
80
100
120
140
160
180
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
MB
ytes
/s
64
128
256
512
1000
Disk-mem Read Throughput vs Stripe Size - 3 disk raid0
0
20
40
60
80
100
120
140
160
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
ytes
/s
64
128
256
512
1000
Slide: 30Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
30
Raid5 Disk Performance vs readahead_max
BaBar Disk Server Tyan Tiger S2466N
motherboard 1 64bit 66 MHz PCI bus Athlon MP2000+ CPU AMD-760 MPX chipset 3Ware 7500-8 RAID5 8 * 200Gb Maxtor IDE
7200rpm disks Note the VM parameter
readahead max
Disk to memory (read)Max throughput 1.2 Gbit/s 150 MBytes/s)
Memory to disk (write)Max throughput 400 Mbit/s 50 MBytes/s)[not as fast as Raid0]
Slide: 31Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
31
Host, PCI & RAID Controller Performance
RAID0 (striped) & RAID5 (stripped with redundancy) 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard Disk: Maxtor 160GB 7200rpm 8MB Cache Read ahead kernel tuning: /proc/sys/vm/max-readahead
Slide: 32Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
32
Serial ATA Raid Controllers RAID5
3Ware 66 MHz PCI
Read Throughput raid5 4 3Ware 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
ICP 66 MHz PCI
Write Throughput raid5 4 3Ware 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
1800
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 516readahead max 1200
Read Throughput raid5 4 ICP 66MHz SATA disk
0
100
200
300
400
500
600
700
800
900
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
Write Throughput raid5 4 ICP 66MHz SATA disk
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000 1200 1400 1600 1800 2000
File size MBytes
Mb
it/s
readahead max 31readahead max 63readahead max 127readahead max 256readahead max 512readahead max 1200
Slide: 33Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
33
RAID Controller PerformanceR
AID
0R
AID
5
Read Speed Write Speed
Slide: 34Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
34
Gridftp Throughput + Web100 RAID0 Disks:
960 Mbit/s read 800 Mbit/s write
Throughput Mbit/s:
See alternate 600/800 Mbit and zero Data Rate: 520 Mbit/s
Cwnd smooth No dup Ack / send stall /
timeouts
Slide: 35Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
35
http data transfers HighSpeed TCP Same Hardware RAID0 Disks Bulk data moved by web servers Apachie web server
out of the box! prototype client - curl http library 1Mbyte TCP buffers 2Gbyte file Throughput ~720 Mbit/s Cwnd - some variation No dup Ack / send stall / timeouts
Slide: 36Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester
36
Bbcp & GridFTP Throughput RAID5 - 4disks Manc – RAL 2Gbyte file transferred bbcp Mean 710 Mbit/s
GridFTP See many zeros
Mean ~710
Mean ~620
DataTAG altAIMD kernel in BaBar & ATLAS